Data scienceR for statistics

机器学习--无监督--PCA主成分分析

2021-11-20  本文已影响0人  小贝学生信

PCA, Principal Components Analysis主成分分析是针对高维复杂数据降维的常用算法。

1、简单理解

2、代码实操

2.1 示例数据

url <- "https://koalaverse.github.io/homlr/data/my_basket.csv"
my_basket <- readr::read_csv(url)
dim(my_basket)
## [1] 2000 42
my_basket[1:4,1:8]
# A tibble: 4 x 8
# `7up` lasagna pepsi   yop red.wine cheese   bbq bulmers
# <dbl>   <dbl> <dbl> <dbl>    <dbl>  <dbl> <dbl>   <dbl>
# 1     0       0     0     0        0      0     0       0
# 2     0       0     0     0        0      0     0       0
# 3     0       0     0     0        0      0     0       0
# 4     0       0     0     2        1      0     0       0

2.2 分析R包

library(h2o) # performing dimension reduction
h2o::h2o.prcomp()

(1)分析的时候,需要将数据集转为特定的H2OFrame对象格式;
(2)在进行PCA分析时,需要对数据清洗:处理缺失值、分类变量转为数值变量(one-hot),以及标准化;
(3)相关重要参数:
pca_method设定主成分分析方法。默认为GramSVD,适合特征变量大多为数值型的数据;GLRM则适合特征变量大多为分类型的数据。
k表示保留多少个主成分。建议保留与原始特征变量相同数目的主成分(最多),到后面再挑选即可。
transform表示是否对数据标准化,默认为none。如需标准化,推荐使用STANDARDIZE。
impute_missing表示是否处理缺失值,如果为TRUE,则使用列均值代替。

2.3 分析实操

# run PCA
h2o.no_progress() # turn off progress bars for brevity
h2o.init(max_mem_size = "5g") # connect to H2O instance
# convert data to h2o object
my_basket.h2o <- as.h2o(my_basket)
my_pca <- h2o.prcomp(
  training_frame = my_basket.h2o,
  pca_method = "GramSVD",
  k = ncol(my_basket.h2o),
  transform = "STANDARDIZE",
  impute_missing = TRUE,
  max_runtime_secs = 1000
)
my_pca@model$eigenvectors[,c(1:4)]
#           pc1         pc2         pc3        pc4
# 7up      -0.007301550 -0.05311817 -0.27323018 -0.1827380
# lasagna  -0.114769325 -0.24694768  0.11432441  0.2226329
# pepsi    -0.002710653 -0.09608739 -0.29949758 -0.2326095
# yop       0.005479054 -0.06277877 -0.18666903 -0.1335696
# red.wine  0.272476523 -0.11728046  0.08847515 -0.2388557
# 
# ---
#               pc1        pc2        pc3         pc4
# soup         -0.08011242 -0.1508818 0.03499992  0.11826536
# toad.in.hole -0.04994771 -0.1329308 0.11074945  0.12391478
# coco.pops    -0.06154965  0.1633611 0.01836804 -0.11058194
# kitkat        0.18344895  0.2155878 0.08236436  0.09445834
# broccoli     -0.22706222  0.0835667 0.03656377 -0.11632245
# cigarettes    0.14824297  0.1998176 0.10945296  0.04793923

my_pca@model$eigenvectors %>%
  as.data.frame() %>%
  mutate(feature = row.names(.)) %>%
  ggplot(aes(pc1, reorder(feature, pc1))) +
  geom_point()

(2)每个主成分的重要性/方差解释度

my_pca@model$importance[,1:4]
#                             pc1        pc2        pc3        pc4
# Standard deviation     1.51391887 1.47376825 1.45911373 1.44063487
# Proportion of Variance 0.05457025 0.05171412 0.05069078 0.04941497
# Cumulative Proportion  0.05457025 0.10628436 0.15697514 0.20639012

data.frame(
  PC = my_pca@model$importance %>% seq_along(),
  PVE = my_pca@model$importance %>% .[2,] %>% unlist()
) %>%
  tidyr::gather(metric, variance_explained, -PC) %>%
  ggplot(aes(PC, variance_explained)) +
  geom_point() 

(3)计算每个消费者对于这些主成分的得分

pred = predict(my_pca, my_basket.h2o)
pred[1:4,1:4]
#     PC1      PC2         PC3         PC4
# 1  1.4129574 1.345266  1.68478079  0.38638015
# 2 -2.6380764 1.915549  0.04180579 -0.53584123
# 3 -0.9495209 2.173075 -0.46798305 -0.09323332
# 4  3.1338889 1.042791 -0.52911233 -0.47940245

The frank answer is that there is no one best method for determining how many components to use.

但也有几种思路可供参考,比如

cumulative variance explained (CVE)
# How many PCs required to explain at least 75% of total variability
cve = my_pca@model$importance[3,]
min(which(cve >= 0.75))
# [1] 27

data.frame(
  PC = my_pca@model$importance %>% seq_along(),
  CVE = my_pca@model$importance %>% .[3,] %>% unlist()
) %>%
  tidyr::gather(metric, variance_explained, -PC) %>%
  ggplot(aes(PC, variance_explained)) +
  geom_point() + geom_vline(aes(xintercept=27),colour="red")
Scree plot criterion:寻找主成分-PVE散点图的骤降点
data.frame(
  PC = my_pca@model$importance %>% seq_along,
  PVE = my_pca@model$importance %>% .[2,] %>% unlist()
) %>%
  ggplot(aes(PC, PVE, group = 1, label = PC)) +
  geom_point() +
  geom_line() +
  geom_text(nudge_y = -.002)
上一篇 下一篇

猜你喜欢

热点阅读