R语言与统计分析生物信息学与算法诗翔的R语言学习之路

「r<-统计分析」层次聚类与非层次聚类

2020-07-03  本文已影响0人  王诗翔

原文链接:https://www.rpubs.com/dvallslanaquera/clustering

层次聚类 (HC)

在这个分析中,我们将看到如何创建层次聚类模型。目的是探索数据库中是否存在相似性组,并查看它们的行为。

例如,我们将使用Doubs数据库,该数据库基于从法国Doubs河中提取的鱼类样本的物理特征。其目的是查看样本的行为以及如何对数据进行分组。

1- 数据准备

我们需要删除带有双零或NA值的行,否则当我们尝试创建树状图时,它们将会出现问题。然后我们需要根据它们的距离对值进行规格化。这次我们将使用欧氏距离,但也有其他有用的距离方法。

library(cluster)
library(vegan)
library(ade4)

data(doubs)
spe <- doubs$fish[-8,] 
env <- doubs$env[-8,]
spa <- doubs$xy[-8,]

spe.norm <- decostand(spe, "normalize") 
spe.ch <- vegdist(spe.norm, "euc") 

2- 聚类方法选择

我们将选择所有可用的方法,然后我们将选择一个最佳的验证分析。方法有单点法、完整法、平均法、质心法和Ward法(single, complete, average, centroid and Ward)。

par(mfrow = c(2, 3)) 

spe.ch.single <- hclust(spe.ch, method = "single") 
plot(spe.ch.single, sub = "Single method")

spe.ch.complete <- hclust(spe.ch, method = "complete")
plot(spe.ch.complete, sub = "Complete method") 

spe.ch.UPGMA <- hclust(spe.ch, method = "average") 
plot(spe.ch.UPGMA, sub = "Average method")

spe.ch.centroid <- hclust(spe.ch, method = "centroid") 
plot(spe.ch.centroid, sub = "Centroid method")

spe.ch.ward <- hclust(spe.ch, method = "ward.D") 
plot(spe.ch.ward, sub = "Ward method")

这个树状图的结构证明是有问题的。我们可以在树状图上观察到重叠,因此这种方法不再有效。

3- 选择最佳方法

在质心法的情况下,我们可以看到过拟合。在其他情况下,我们必须计算:

3.1. 同表型相关性

spe.ch.single.coph <- cophenetic(spe.ch.single) # Single
paste("Single cophenetic:", cor(spe.ch, spe.ch.single.coph))
## [1] "Single cophenetic: 0.599192957534406"
spe.ch.comp.coph <- cophenetic(spe.ch.complete) # Complete
paste("Complete cophenetic:", cor(spe.ch, spe.ch.comp.coph))
## [1] "Complete cophenetic: 0.765562801324477"
spe.ch.UPGMA.coph <- cophenetic(spe.ch.UPGMA) # Average
paste("UPGMA cophenetic:", cor(spe.ch, spe.ch.UPGMA.coph))
## [1] "UPGMA cophenetic: 0.860832629864453"
spe.ch.ward.coph <- cophenetic(spe.ch.ward) # Ward 
paste("Ward cophenetic:", cor(spe.ch, spe.ch.ward.coph))
## [1] "Ward cophenetic: 0.759393401189188"
cor(spe.ch, spe.ch.ward.coph, method = "spearman")
## [1] 0.7661171

同表型相关的图示(谢泼德图 Shepard diagram)。越靠近对角线越好。

par(mfrow = c(2, 2))
plot(spe.ch, spe.ch.single.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
     main = c("Single linkage", paste("Cophenetic correlation =",
                                    round(cor(spe.ch, spe.ch.single.coph), 3))))
abline(0, 1);  lines(lowess(spe.ch, spe.ch.single.coph), col = "red")

plot(spe.ch, spe.ch.comp.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)), ylim = c(0,sqrt(2)),
     main = c("Complete linkage", paste("Cophenetic correlation =",
                                      round(cor(spe.ch, spe.ch.comp.coph), 3))))
abline(0, 1);  lines(lowess(spe.ch, spe.ch.comp.coph), col = "red")

plot(spe.ch, spe.ch.UPGMA.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)), ylim = c(0,sqrt(2)),
     main = c("UPGMA", paste("Cophenetic correlation =",
                           round(cor(spe.ch, spe.ch.UPGMA.coph), 3))))
abline(0, 1);  lines(lowess(spe.ch, spe.ch.UPGMA.coph), col = "red")

plot(spe.ch, spe.ch.ward.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)), 
     ylim = c(0,max(spe.ch.ward$height)),
     main = c("Ward clustering", paste("Cophenetic correlation =",
                                     round(cor(spe.ch, spe.ch.ward.coph), 3))))
abline(0, 1);  lines(lowess(spe.ch, spe.ch.ward.coph), col = "red")

最佳拟合方法为UPGMA,同位相关系数为0.861,比较高。

高尔距离值(越低越好)

(gow.dist.single <- sum((spe.ch - spe.ch.single.coph) ^ 2))
## [1] 95.41391
(gow.dist.comp <- sum((spe.ch - spe.ch.comp.coph) ^ 2))
## [1] 40.48897
(gow.dist.UPGMA <- sum((spe.ch - spe.ch.UPGMA.coph) ^ 2))
## [1] 11.6746
(gow.dist.ward <- sum((spe.ch - spe.ch.ward.coph) ^ 2))
## [1] 8001.85

最佳方法仍然是UPGMA

3- 最后聚类数目的选择

为了达到这个目的,我们需要 3 个不同的检验:

a- Fussion 水平图

dev.off()
## null device 
##           1
plot(spe.ch.ward$height, nrow(spe) : 2, type = "S", 
     main = "Fusion levels - Chord - Ward", 
     ylab = "k (number of clusters)", xlab = "h (node height)", col = "grey")
text(spe.ch.ward$height, nrow(spe) : 2, nrow(spe) : 2, col = "red", cex = 0.8)

根据图结果,最佳数目是 2 或 4。

b- Silhouette 图

asw <- numeric(nrow(spe))
for(k in 2:(nrow(spe) - 1)){
  sil <- silhouette(cutree(spe.ch.ward, k = k), spe.ch)
  asw[k] <- summary(sil)$avg.width}
k.best <- which.max(asw)

plot(1: nrow(spe), asw, type="h", 
     main = "Silhouette-optimal number of clusters", 
     xlab = "k (number of groups)", ylab = "Average silhouette width")
axis(1, k.best, paste("optimum", k.best, sep = "\n"), col = "red", font = 2,
     col.axis = "red")
points(k.best, max(asw), pch = 16, col = "red", cex = 1.5)
cat("", "Silhouette-optimal number of clusters k =", k.best, "\n", 
    "with an average silhouette width of", max(asw), "\n")
##  Silhouette-optimal number of clusters k = 2 
##  with an average silhouette width of 0.3658319

c- Mantel 值

grpdist <- function(X){
  require(cluster)
  gr <- as.data.frame(as.factor(X))
  distgr <- daisy(gr, "gower")
  distgr}

kt <- data.frame(k = 1:nrow(spe), r = 0)
for(i in 2:(nrow(spe) - 1)){
  gr <- cutree(spe.ch.ward, i)
  distgr <- grpdist(gr)
  mt <- cor(spe.ch, distgr, method = "pearson")
  kt[i, 2] <- mt}
k.best <- which.max(kt$r)

plot(kt$k, kt$r, type = "h", main = "Mantel-optimal number of clusters", 
     xlab = "k (number of groups)", ylab = "Pearson's correlation")
axis(1, k.best, paste("optimum", k.best, sep = "\n"), col = "red", font = 2,
     col.axis = "red")
points(k.best, max(kt$r), pch = 16, col = "red", cex = 1.5)
cat("", "Mantel-optimal number of clusters k =", k.best, "\n", 
    "with a matrix linear correlation of", max(kt$r), "\n")
##  Mantel-optimal number of clusters k = 4 
##  with a matrix linear correlation of 0.7154912

第一和第二种方法表明,聚类的最佳数量是k = 2,而 Mantel 值表明,数量必须是4。由于理论表明物种的数量为4,我们将在k = 4时验证我们的模型。

4- 最后模型验证 k = 4

k <- 4 
cutg <- cutree(spe.ch.ward, k = k)
sil <- silhouette(cutg, spe.ch)
rownames(sil) <- row.names(spe)

plot(sil, main = "Silhouette plot", 
     cex.names = 0.8, col = 2:(k + 1), nmax = 100)

这些组是一致的,但是第二组有两个未分类的值。

5- 最后的图

到目前为止,我们已经通过UPGMA方法将我们的数据分组为4个簇。现在我们将使用Francois Gillet(2012)创建的hcoplot函数来描述树图的行为。

hcoplot <- function(tree, diss, k, 
                      title = paste("Reordered dendrogram from", deparse(tree$call), sep = "\n"))
{
  require(gclus)
  gr <- cutree(tree, k = k)
  tor <- reorder.hclust(tree, diss)
  plot(tor, hang = -1, xlab = paste(length(gr),"sites"), sub = paste(k, "clusters"), 
       main = title)
  so <- gr[tor$order]
  gro <- numeric(k)
  for (i in 1:k)
  {
    gro[i] <- so[1]
    if (i<k) so <- so[so != gro[i]]
  }
  rect.hclust(tor, k = k, border = gro + 1, cluster = gr)
  legend("topright", paste("Cluster", 1:k), pch = 22, col = 2:(k + 1), bty = "n")
}

hcoplot(spe.ch.ward, spe.ch, k = 4)

非层次聚类 (NHC)

这次我们将做一个k均值聚类模型。

1-数据标准化

之前已经做过。

2- 选择聚类方法

set.seed(1) 
spe.kmeans <- kmeans(spe.norm, centers = 4, nstart = 100)

我们创建了包含4组的模型,与之前的HC模型相同。这里是每一组的中心,每一组的方差。该模型可以解释总方差的66.7%。

3- 选择聚类数和模型验证

我们使用以下标准:

3.1. Calinski method + SSI + SSE

spe.KM.cascade <- cascadeKM(spe.norm, inf.gr = 2, sup.gr = 10, iter = 100, criterion = "ssi")
spe.KM.cascade$results 
##      2 groups  3 groups  4 groups  5 groups  6 groups  7 groups   8 groups
## SSE 8.2149405 6.4768108 5.0719796 4.3015573 3.5856120 2.9523667 2.48405487
## ssi 0.1312111 0.1675852 0.1244603 0.1149501 0.1281785 0.1306085 0.07275788
##     9 groups  10 groups
## SSE 2.052189 1.75992916
## ssi 0.126707 0.07094594
plot(spe.KM.cascade, sortg = TRUE) 
img

统计数字不能决定一切。通过SSE方法,最好的聚类数必须是2,通过SSI方法则必须是3。

3.2. Silhouette 图

我们试着绘制 3 组的轮廓系数图。

spe.kmeans <- kmeans(spe.norm, centers = 3, nstart = 100)
dissE <- daisy(spe.norm) 
sk <- silhouette(spe.kmeans$cl, dissE) 
plot(sk)

4- 最后的图形和解释

现在我们继续比较前一个树状图和之前的分类。

spebc.ward.g <- cutree(spe.ch.ward,k = 4)
table(spe.kmeans$cluster, spebc.ward.g)
##    spebc.ward.g
##      1  2  3  4
##   1  0  6  0  0
##   2 11  1  0  0
##   3  0  0  8  3

它们只在一种情况的分类上有所不同。

clusplot(spe.norm, spe.kmeans$cluster, color = TRUE, shade = TRUE, 
         labels = 2, lines = 0)

组与组之间有一些重叠,但我们解释了高达61.75%的变异性,这是一个很好的百分比。

上一篇下一篇

猜你喜欢

热点阅读