R语言4_PCA and tSNE

PCA-Statistics is the new sexy!!

2019-04-28  本文已影响111人  Juan_NF
Shelork Holmes里的一句台词是,Brainy is the new sexy;学了PCA的推导后,我觉得,统计学对我简直是降维打击,所以改下这句话,Statistics is the new sexy!!!

理解PCA

PCA是为了更好地展示多维数据,通过线性转化,展示保留最多信息的主成分;将样本尽可能地分散地展示在坐标轴中达到可视化的目的;

PCA步骤:

1)数据为m行n列的原始矩阵(sample为行,gene为列)
2)矩阵X每一个元素减去该列的均值(中心化)
目的是使所有维度的偏移都是以0为基础的(我们必须对数据中individual(sample)和observations(gene)有区分和了解)
3)求出协方差矩阵
协方差矩阵为对称矩阵,对角线为每行方差,其他元素分别为不同行的协方差,协方差表示的是各行元素之间的线性相关性;
4)目的是协方差矩阵中除对角线外的元素为0,即实现协方差矩阵对角化;
协方差矩阵为可对角化矩阵,对角化后的矩阵中,对角线上的元素保持不变,但对角线外的元素为0;这样便实现了使各行元素之间无相关性;对协方差矩阵进行特征值分解即可获得P;

image.png

5)将P按特征值进行排序,因为Y=PX,所以,中心化后的矩阵(转置)与特征向量矩阵(转置)乘积后得到新的样本矩阵,取前两行即PC1和PC2;

我们用R的基础函数简略演示下新的样本矩阵的由来,也更好的理解一下PCA
1.以下是原始矩阵:
> rm(list=ls())
> library("FactoMineR")
> library("factoextra")
> data("decathlon2")
> decathlon2.active <- decathlon2[1:23, 1:10]
> decathlon2.active
            X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
SEBRLE      11.04      7.58    14.83      2.07 49.81        14.69  43.75
CLAY        10.76      7.40    14.26      1.86 49.37        14.05  50.72
BERNARD     11.02      7.23    14.25      1.92 48.93        14.99  40.87
YURKOV      11.34      7.09    15.19      2.10 50.42        15.31  46.26
ZSIVOCZKY   11.13      7.30    13.48      2.01 48.62        14.17  45.67
McMULLEN    10.83      7.31    13.76      2.13 49.91        14.38  44.41
MARTINEAU   11.64      6.81    14.57      1.95 50.14        14.93  47.60
HERNU       11.37      7.56    14.41      1.86 51.10        15.06  44.99
BARRAS      11.33      6.97    14.09      1.95 49.48        14.48  42.10
NOOL        11.33      7.27    12.68      1.98 49.20        15.29  37.92
BOURGUIGNON 11.36      6.80    13.46      1.86 51.16        15.67  40.49
Sebrle      10.85      7.84    16.36      2.12 48.36        14.05  48.72
Clay        10.44      7.96    15.23      2.06 49.19        14.13  50.11
Karpov      10.50      7.81    15.93      2.09 46.81        13.97  51.65
Macey       10.89      7.47    15.73      2.15 48.97        14.56  48.34
Warners     10.62      7.74    14.48      1.97 47.97        14.01  43.73
Zsivoczky   10.91      7.14    15.31      2.12 49.40        14.95  45.62
Hernu       10.97      7.19    14.65      2.03 48.73        14.25  44.72
Bernard     10.69      7.48    14.80      2.12 49.13        14.17  44.75
Schwarzl    10.98      7.49    14.01      1.94 49.76        14.25  42.43
Pogorelov   10.95      7.31    15.10      2.06 50.79        14.21  44.60
Schoenbeck  10.90      7.30    14.77      1.88 50.30        14.34  44.41
Barras      11.14      6.99    14.91      1.94 49.41        14.37  44.83
            Pole.vault Javeline X1500m
SEBRLE            5.02    63.19 291.70
CLAY              4.92    60.15 301.50
BERNARD           5.32    62.77 280.10
YURKOV            4.72    63.44 276.40
ZSIVOCZKY         4.42    55.37 268.00
McMULLEN          4.42    56.37 285.10
MARTINEAU         4.92    52.33 262.10
HERNU             4.82    57.19 285.10
BARRAS            4.72    55.40 282.00
NOOL              4.62    57.44 266.60
BOURGUIGNON       5.02    54.68 291.70
Sebrle            5.00    70.52 280.01
Clay              4.90    69.71 282.00
Karpov            4.60    55.54 278.11
Macey             4.40    58.46 265.42
Warners           4.90    55.39 278.05
Zsivoczky         4.70    63.45 269.54
Hernu             4.80    57.76 264.35
Bernard           4.40    55.27 276.31
Schwarzl          5.10    56.32 273.56
Pogorelov         5.00    53.45 287.63
Schoenbeck        5.00    60.89 278.82
Barras            4.60    64.55 267.09
2.以下是对原始矩阵的(维度)进行中心化的结果:
> #######R_base协方差|特征值|特征向量
> #tmp<-rowMeans(t(decathlon2.active))
> #center_deca<-t(decathlon2.active)-tmp
> center_d<-scale(decathlon2.active,center=TRUE,scale=FALSE)
> center_d
                  X100m   Long.jump Shot.put    High.jump       X400m
SEBRLE       0.04043478  0.23043478     0.21  0.062608696  0.37695652
CLAY        -0.23956522  0.05043478    -0.36 -0.147391304 -0.06304348
BERNARD      0.02043478 -0.11956522    -0.37 -0.087391304 -0.50304348
YURKOV       0.34043478 -0.25956522     0.57  0.092608696  0.98695652
ZSIVOCZKY    0.13043478 -0.04956522    -1.14  0.002608696 -0.81304348
McMULLEN    -0.16956522 -0.03956522    -0.86  0.122608696  0.47695652
MARTINEAU    0.64043478 -0.53956522    -0.05 -0.057391304  0.70695652
HERNU        0.37043478  0.21043478    -0.21 -0.147391304  1.66695652
BARRAS       0.33043478 -0.37956522    -0.53 -0.057391304  0.04695652
NOOL         0.33043478 -0.07956522    -1.94 -0.027391304 -0.23304348
BOURGUIGNON  0.36043478 -0.54956522    -1.16 -0.147391304  1.72695652
Sebrle      -0.14956522  0.49043478     1.74  0.112608696 -1.07304348
Clay        -0.55956522  0.61043478     0.61  0.052608696 -0.24304348
Karpov      -0.49956522  0.46043478     1.31  0.082608696 -2.62304348
Macey       -0.10956522  0.12043478     1.11  0.142608696 -0.46304348
Warners     -0.37956522  0.39043478    -0.14 -0.037391304 -1.46304348
Zsivoczky   -0.08956522 -0.20956522     0.69  0.112608696 -0.03304348
Hernu       -0.02956522 -0.15956522     0.03  0.022608696 -0.70304348
Bernard     -0.30956522  0.13043478     0.18  0.112608696 -0.30304348
Schwarzl    -0.01956522  0.14043478    -0.61 -0.067391304  0.32695652
Pogorelov   -0.04956522 -0.03956522     0.48  0.052608696  1.35695652
Schoenbeck  -0.09956522 -0.04956522     0.15 -0.127391304  0.86695652
Barras       0.14043478 -0.35956522     0.29 -0.067391304 -0.02304348
            X110m.hurdle     Discus   Pole.vault   Javeline      X1500m
SEBRLE        0.15608696 -1.4104348  0.223478261  4.0752174  13.8221739
CLAY         -0.48391304  5.5595652  0.123478261  1.0352174  23.6221739
BERNARD       0.45608696 -4.2904348  0.523478261  3.6552174   2.2221739
YURKOV        0.77608696  1.0995652 -0.076521739  4.3252174  -1.4778261
ZSIVOCZKY    -0.36391304  0.5095652 -0.376521739 -3.7447826  -9.8778261
McMULLEN     -0.15391304 -0.7504348 -0.376521739 -2.7447826   7.2221739
MARTINEAU     0.39608696  2.4395652  0.123478261 -6.7847826 -15.7778261
HERNU         0.52608696 -0.1704348  0.023478261 -1.9247826   7.2221739
BARRAS       -0.05391304 -3.0604348 -0.076521739 -3.7147826   4.1221739
NOOL          0.75608696 -7.2404348 -0.176521739 -1.6747826 -11.2778261
BOURGUIGNON   1.13608696 -4.6704348  0.223478261 -4.4347826  13.8221739
Sebrle       -0.48391304  3.5595652  0.203478261 11.4052174   2.1321739
Clay         -0.40391304  4.9495652  0.103478261 10.5952174   4.1221739
Karpov       -0.56391304  6.4895652 -0.196521739 -3.5747826   0.2321739
Macey         0.02608696  3.1795652 -0.396521739 -0.6547826 -12.4578261
Warners      -0.52391304 -1.4304348  0.103478261 -3.7247826   0.1721739
Zsivoczky     0.41608696  0.4595652 -0.096521739  4.3352174  -8.3378261
Hernu        -0.28391304 -0.4404348  0.003478261 -1.3547826 -13.5278261
Bernard      -0.36391304 -0.4104348 -0.396521739 -3.8447826  -1.5678261
Schwarzl     -0.28391304 -2.7304348  0.303478261 -2.7947826  -4.3178261
Pogorelov    -0.32391304 -0.5604348  0.203478261 -5.6647826   9.7521739
Schoenbeck   -0.19391304 -0.7504348  0.203478261  1.7752174   0.9421739
Barras       -0.16391304 -0.3304348 -0.196521739  5.4352174 -10.7878261
attr(,"scaled:center")
       X100m    Long.jump     Shot.put    High.jump        X400m X110m.hurdle 
   10.999565     7.349565    14.620000     2.007391    49.433043    14.533913 
      Discus   Pole.vault     Javeline       X1500m 
   45.160435     4.796522    59.114783   277.877826 
3.以下是对中心化后矩阵求协方差矩阵结果:
> cov_deca<-cov(center_d)
> cov_deca
                    X100m   Long.jump     Shot.put    High.jump       X400m
X100m         0.090686166 -0.07183202 -0.113395455 -0.011728458  0.18034684
Long.jump    -0.071832016  0.09821344  0.116295455  0.010353360 -0.16100771
Shot.put     -0.113395455  0.11629545  0.713518182  0.043518182 -0.26718636
High.jump    -0.011728458  0.01035336  0.043518182  0.009347431 -0.03579170
X400m         0.180346838 -0.16100771 -0.267186364 -0.035791700  1.01420395
X110m.hurdle  0.106465415 -0.09000731 -0.155872727 -0.011534783  0.28359209
Discus       -0.483517984  0.47429111  1.992863636  0.109001186 -1.21791502
Pole.vault    0.006589328  0.00160751  0.003827273 -0.012027668  0.06342925
Javeline     -0.436175099  0.56543854  1.992636364  0.106272134 -0.66853340
X1500m       -0.668191897  0.69294901 -0.396386364 -0.251305929  2.95288419
             X110m.hurdle     Discus   Pole.vault   Javeline      X1500m
X100m          0.10646542 -0.4835180  0.006589328 -0.4361751  -0.6681919
Long.jump     -0.09000731  0.4742911  0.001607510  0.5654385   0.6929490
Shot.put      -0.15587273  1.9928636  0.003827273  1.9926364  -0.3963864
High.jump     -0.01153478  0.1090012 -0.012027668  0.1062721  -0.2513059
X400m          0.28359209 -1.2179150  0.063429249 -0.6685334   2.9528842
X110m.hurdle   0.23411581 -0.8522518  0.017505138 -0.1694377  -0.2303729
Discus        -0.85225178 11.0625316 -0.157434783  4.6430569   2.5717646
Pole.vault     0.01750514 -0.1574348  0.062278261  0.2798992   0.9729648
Javeline      -0.16943775  4.6430569  0.279899209 24.3063261   4.4479336
X1500m        -0.23037292  2.5717646  0.972964822  4.4479336 100.2299542
4.对协方差矩阵求特征向量和特征值:
> deca_rotation<-eigen(cov_deca)
> deca_rotation
eigen() decomposition
$values
 [1] 1.006840e+02 2.578783e+01 9.968741e+00 8.647143e-01 2.822286e-01 1.232549e-01
 [7] 4.927276e-02 4.052578e-02 1.742956e-02 3.137752e-03

$vectors
              [,1]         [,2]         [,3]        [,4]         [,5]
 [1,]  0.006986928 -0.021005809  0.035210712 -0.19034812 -0.071024702
 [2,] -0.007312503  0.025532679 -0.029814632  0.16625248  0.067170085
 [3,]  0.002193901  0.101099662 -0.134997417 -0.01096451  0.970306608
 [4,]  0.002402614  0.006076201 -0.008013291  0.01753102  0.059193880
 [5,] -0.028740478 -0.049811926  0.111757272 -0.91565675  0.057823061
 [6,]  0.002581313 -0.017216936  0.081278216 -0.27124976 -0.055223739
 [7,] -0.031355058  0.306079695 -0.924526837 -0.14964630 -0.160947293
 [8,] -0.009777235  0.005798730  0.025326389 -0.01932767  0.101075155
 [9,] -0.059802247  0.942306214  0.324118956 -0.01034241 -0.053718141
[10,] -0.997195770 -0.064897755  0.006522677  0.02866866  0.006768862
              [,6]         [,7]        [,8]         [,9]         [,10]
 [1,] -0.291402114  0.320507924  0.13921549  0.861711536 -0.0861683397
 [2,]  0.251412655 -0.413861269 -0.75078605  0.408212303  0.0380464490
 [3,] -0.109846963 -0.021747143  0.09819153  0.047978292  0.0738364582
 [4,]  0.002220758 -0.234201394  0.08344638 -0.013058493 -0.9664847617
 [5,]  0.361562371 -0.078839429 -0.06140280 -0.041452501  0.0008151440
 [6,] -0.840883209 -0.290782847 -0.29340766 -0.193359991  0.0367327281
 [7,] -0.030340336  0.017305544 -0.02788910 -0.013974306 -0.0095425190
 [8,] -0.026222073  0.761309040 -0.55608405 -0.221336323 -0.2239225812
 [9,]  0.011114470  0.002280716  0.01715753  0.003689028  0.0005149581
[10,] -0.016174502 -0.001957392  0.01320850  0.006200789 -0.0005126593
5.中心化后矩阵(转置)与特征向量(转置)乘积后取前两行即PC1和PC2:
> (t(deca_rotation$vectors)%*%t(center_d))[1:2,]
         SEBRLE       CLAY   BERNARD   YURKOV ZSIVOCZKY  McMULLEN MARTINEAU
[1,] -13.996304 -23.795992 -2.289494 1.160683 10.082984 -7.027159 16.050504
[2,]   2.517846   1.125352  1.965726 4.489585 -2.806377 -3.391762 -4.696647
         HERNU    BARRAS      NOOL BOURGUIGNON    Sebrle      Clay    Karpov
[1,] -7.128017 -3.789459 11.582324  -13.417010 -2.892805 -4.901404 -0.149155
[2,] -2.451095 -4.777138 -3.270108   -6.749538 11.953568 11.340372 -1.102863
        Macey    Warners Zsivoczky      Hernu   Bernard  Schwarzl Pogorelov
[1,] 12.38074  0.1296911  8.046438 13.6052392  1.815421  4.542704 -9.409190
[2,]  1.30346 -3.8728285  4.827768 -0.4938572 -3.599063 -3.256776 -6.154349
     Schoenbeck    Barras
[1,]  -1.049881 10.449143
[2,]   1.358511  5.740215

我们汲汲以求的PCA其实早有对统计学烂熟于心的人做了R包,不得不说,数学才是王道啊!!!

对比下在R的现成的PCA功能的结果
  • FactoMineR和factoextra配合做PCA和可视化;
  • prcomp(stats base级别)和autoplot配合做PCA和可视化;
######prcomp为stats自带的PCA函数
deca_re<-prcomp(decathlon2.active)
#####rotation-包含特征向量的矩阵
deca_re$rotation[, 1]
#####x-如果retx参数设为TRUE,则返回rotated矩阵(中心化(scaled,如果有设定)矩阵乘以rotation矩阵)的结果;cov(x)为协方差矩阵;
deca_re$x
#####ggfortify使ggplot2功能更加丰富,使autoplot能够处理prcomp的结果
library(ggfortify)
autoplot(prcomp(decathlon2.active),label=TRUE,loadings=TRUE)
res<-PCA(X = decathlon2.active, scale.unit = FALSE, ncp = 5, graph = FALSE)
res$ind$coord
#####res调出对PCA函数结果的更详尽说明
res
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 23 individuals, described by 10 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"
######fviz_pca_ind对individual绘图,fviz_pca_var对variable绘图
fviz_pca_ind(res)
#####对coord处理后获得特征向量,与prcomp中的rotation一致
loadings<-sweep(res$var$coord,2,sqrt(res$eig[1:5,1]),FUN="/")
loadings
PCA
看着跟前面的图坐标位置哪儿哪儿不一样,后面再用$x画图后,看到是坐标比例的差异,再对比发现,与上图是某种镜像的关系,相对位置其实是一样的:
prcomp
Rplot_deca$x

参考内容
1.https://www.zhihu.com/question/36481348
2.http://www.sthda.com/english/wiki/print.php?id=204
3.https://wenku.baidu.com/view/ce7ee04bcc175527072208ca.html
4.https://zhuanlan.zhihu.com/p/21580949
5.https://groups.google.com/forum/#!topic/factominer-users/BRN8jRm-_EM
6.http://blog.sciencenet.cn/home.php?mod=space&uid=443073&do=blog&id=321347


课程分享
生信技能树全球公益巡讲
https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
B站公益74小时生信工程师教学视频合辑
https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
招学徒:
https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw

上一篇 下一篇

猜你喜欢

热点阅读