统计分析与数据挖掘

PCA 结果用于层次聚类

2020-06-06  本文已影响0人  caokai001

参考:

Clustering with selected Principal Components
R语言:prcomp做主成分分析(PCA)
https://github.com/vallotlab/scChIPseq/blob/master/R_scChIP_seq_analysis.R
每R一点:层次聚类分析实例实战-dist、hclust、heatmap等

缘由:

降维后进行层次聚类,通常会选择用dist 计算降维后样本距离,这篇文章用的1-cor() 来代替距离计算,记录一下。

1591456825403.png

1.模拟数据

set.seed(1995)  
# 随机种子
data=matrix(abs(round(rnorm(100, mean=1000, sd=500))), 10, 10)  
# 随机正整数,20行,20列
colnames(data)=paste("变量", 1:10, sep=".")  
# 列名-细菌
rownames(data)=paste("样品", 1:10, sep=".")
1591455758418.png

2.标准化

R函数:scale(data, center=T/F, scale=T/F)
center (中心化):将数据减去均值
scale (标准化):在中心化后的数据基础上再除以数据的标准差

# scale函数进行数据标准化
data2=scale(data) ##默认参数: center=T, scale=T

# plot 函数只能可视化两维
plot(data2, main="scaled data")
1591455736333.png

3.PCA

data2.pca <- stats::prcomp(data2, center=F, scale=F)
# PCA分析
data2.pca
# 查看PCA结果
plot(data2.pca$x)
1591455709905.png

4.层次聚类 : 对PCA降维后坐标进行聚类分析

基于dist 函数计算距离

x = data2.pca$x[,1]
y = data2.pca$x[,2]
z = data2.pca$x[,3]  
#*****************************************************************
# Create clusters
#******************************************************************         
# create and plot clusters based on the first and second principal components
hc = hclust(dist(cbind(x,y)), method = 'ward.D2')
plot(hc, axes=F,xlab='', ylab='',sub ='', main='Comp 1/2')
rect.hclust(hc, k=3, border='red')
1591455946290.png
# create and plot clusters based on the first, second, and third principal components
hc = hclust(dist(cbind(x,y,z)), method = 'ward.D2')
plot(hc, axes=F,xlab='', ylab='',sub ='', main='Comp 1/2/3')
rect.hclust(hc, k=3, border='red')
1591455970666.png

基于相关系数代替距离

# create and plot clusters based on the correlation among companies
mati <- as.matrix(t(data2.pca$x[,1:10]))

hc = hclust(as.dist(1-cor(mati)), method = 'ward.D2')
plot(hc, axes=F,xlab='', ylab='',sub ='', main='Correlation')
rect.hclust(hc, k=3, border='red')
1591456478881.png

思考:

不太确定哪一种比较好,欢迎交流讨论~😂

上一篇下一篇

猜你喜欢

热点阅读