芯片数据分析相关性R语言做图

【R>>PCA】主成分分析

2021-05-11  本文已影响0人  高大石头

PCA:principal component analysis,主成分分析。将n个维度,通过线性转换为新的n个线性无关的按方差解释度排序的主成分(principal component, PC)。


核心函数:

  • prcomp()
  • princomp()
  • FactoMineR::PCA()

由于基础函数画出的图相对比较单一,因此就有大神针对针对这个问题开发了R包:FactoMineRfactoextra

# data为表达矩阵,prcomp默认带有数据标准化功能,如果已标准化,center=F, scale=F)
res.pca <- prcomp(data, center=F, scale=F) 

下面以factoextra包自带的decathlon2数据集为例:
decathlon2是27名运动员,在Decastar和OlympicG两项运动会,十项全能运动的成绩。(其中4名为替补队员)

注意:PCA的输入数据,需要进行归一化让数据间具有可比性。

参数解读:

PCA(X, scale.unit = TRUE, ncp = 5, graph = TRUE)

1.基础版

res.pca <- prcomp(decathlon2.active,scale. = T)
> res.pca
Standard deviations (1, .., p=10):
 [1] 2.0308159 1.3559244 1.1131668 0.9052294 0.8375875 0.6502944 0.5500742 0.5238988
 [9] 0.3939758 0.3492435

Rotation (n x k) = (10 x 10):
                      PC1         PC2         PC3         PC4        PC5          PC6
X100m        -0.418859080  0.13230683 -0.27089959  0.03708806 -0.2321476  0.054398099
Long.jump     0.391064807 -0.20713320  0.17117519 -0.12746997  0.2783669 -0.051865558
Shot.put      0.361388111 -0.06298590 -0.46497777  0.14191803 -0.2970589 -0.368739186
High.jump     0.300413236  0.34309742 -0.29652805  0.15968342  0.4807859 -0.437716883
X400m        -0.345478567 -0.21400770 -0.25470839  0.47592968  0.1240569 -0.075796432
X110m.hurdle -0.376265119  0.01824645 -0.40325254 -0.01866477  0.2676975  0.004048005
Discus        0.365965721 -0.03662510 -0.15857927  0.43636361 -0.4873988  0.305315353
Pole.vault   -0.106985591 -0.59549862 -0.08449563 -0.37447391 -0.2646712 -0.503563524
Javeline      0.210864329 -0.28475723 -0.54270782 -0.36646463  0.2361698  0.556821016
X1500m        0.002106782 -0.57855748  0.19715884  0.49491281  0.3142987  0.064663250
                     PC7         PC8         PC9        PC10
X100m        -0.16604375 -0.19988005 -0.76924639  0.12718339
Long.jump    -0.28056361 -0.75850657 -0.13094589  0.08509665
Shot.put     -0.01797323  0.04649571  0.12129309  0.62263702
High.jump     0.05118848  0.16111045 -0.28463225 -0.38244596
X400m         0.52012255 -0.44579641  0.20854176 -0.09784197
X110m.hurdle -0.67276768 -0.01592804  0.41058421 -0.04475363
Discus       -0.25946615 -0.07550934  0.03391600 -0.49418361
Pole.vault   -0.01889413  0.06282691 -0.06540692 -0.39288155
Javeline      0.24281145  0.10086127 -0.10268134 -0.01103627
X1500m       -0.20245828  0.37119711 -0.25950868  0.17991689

提取PC矩阵的两种方法

res.pca1 <- res.pca$x 
res.pca2 <- predict(res.pca)

2.进阶版

2.1 示例数据

rm(list = ls())
library(FactoMineR)
decathlon2.active <- decathlon2[1:23,1:10]
res.pca <- PCA(decathlon2.active,graph = F)
res.pca
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 23 individuals, described by 10 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

2.2 可视化及注释

核心函数:

library(factoextra)
eig.val <- get_eigenvalue(res.pca)
eig.val
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1       4.124            41.24                        41.2
## Dim.2       1.839            18.39                        59.6
## Dim.3       1.239            12.39                        72.0
## Dim.4       0.819             8.19                        80.2
## Dim.5       0.702             7.02                        87.2
## Dim.6       0.423             4.23                        91.5
## Dim.7       0.303             3.03                        94.5
## Dim.8       0.274             2.74                        97.2
## Dim.9       0.155             1.55                        98.8
## Dim.10      0.122             1.22                       100.0

epigenes可视化

fviz_eig(res.pca,addlabels = T)

Colors by groups

var <- get_pca_var(res.pca)
set.seed(123)
res.km <- kmeans(var$coord,centers = 3,nstart = 25)
grp <- as.factor(res.km$cluster)
fviz_pca_var(res.pca,col.var = grp,
             palette = c("#0073C2FF", "#EFC000FF", "#868686FF"),
             legend.title = "Cluster")

3. 实用版

iris数据集为例,添加cluster及分组信息。

iris.pca <- PCA(iris[,-5],graph = F)
fviz_pca_ind(iris.pca,
             geom = "point",
             col.ind = iris$Species,
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, # Concentration ellipses
             legend.title = "Groups")

加上小箭头

fviz_pca_biplot(iris.pca, 
                col.ind = iris$Species, palette = "jco", 
                addEllipses = TRUE, label = "var",
                col.var = "black", repel = TRUE,
                legend.title = "Species") 

ggplot2版

iris.pca1 <-prcomp(iris[,-5])
pcapredict <- predict(iris.pca1)
rt <- data.frame(PC1=pcapredict[,1],PC2=pcapredict[,2],group=iris[,5])
library(ggsci)
ggplot(rt,aes(PC1,PC2))+
  geom_point(aes(color=group))+
  scale_color_lancet()+
  theme_bw()+
  theme(plot.margin = unit(rep(1.5,4),"lines"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

参考链接:
A COMPLETE GUIDE TO PRINCIPAL COMPONENT ANALYSIS – PCA IN MACHINE LEARNING

PCA - Principal Component Analysis Essentials

上一篇下一篇

猜你喜欢

热点阅读