「r<-统计分析」层次聚类与非层次聚类
原文链接:https://www.rpubs.com/dvallslanaquera/clustering
层次聚类 (HC)
在这个分析中,我们将看到如何创建层次聚类模型。目的是探索数据库中是否存在相似性组,并查看它们的行为。
例如,我们将使用Doubs数据库,该数据库基于从法国Doubs河中提取的鱼类样本的物理特征。其目的是查看样本的行为以及如何对数据进行分组。
1- 数据准备
我们需要删除带有双零或NA值的行,否则当我们尝试创建树状图时,它们将会出现问题。然后我们需要根据它们的距离对值进行规格化。这次我们将使用欧氏距离,但也有其他有用的距离方法。
library(cluster)
library(vegan)
library(ade4)
data(doubs)
spe <- doubs$fish[-8,]
env <- doubs$env[-8,]
spa <- doubs$xy[-8,]
spe.norm <- decostand(spe, "normalize")
spe.ch <- vegdist(spe.norm, "euc")
2- 聚类方法选择
我们将选择所有可用的方法,然后我们将选择一个最佳的验证分析。方法有单点法、完整法、平均法、质心法和Ward法(single, complete, average, centroid and Ward)。
par(mfrow = c(2, 3))
spe.ch.single <- hclust(spe.ch, method = "single")
plot(spe.ch.single, sub = "Single method")
spe.ch.complete <- hclust(spe.ch, method = "complete")
plot(spe.ch.complete, sub = "Complete method")
spe.ch.UPGMA <- hclust(spe.ch, method = "average")
plot(spe.ch.UPGMA, sub = "Average method")
spe.ch.centroid <- hclust(spe.ch, method = "centroid")
plot(spe.ch.centroid, sub = "Centroid method")
spe.ch.ward <- hclust(spe.ch, method = "ward.D")
plot(spe.ch.ward, sub = "Ward method")
这个树状图的结构证明是有问题的。我们可以在树状图上观察到重叠,因此这种方法不再有效。
3- 选择最佳方法
在质心法的情况下,我们可以看到过拟合。在其他情况下,我们必须计算:
- 同表型相关性(Cophenetic correlation)
- 高尔距离值(Gower distance value)
3.1. 同表型相关性
spe.ch.single.coph <- cophenetic(spe.ch.single) # Single
paste("Single cophenetic:", cor(spe.ch, spe.ch.single.coph))
## [1] "Single cophenetic: 0.599192957534406"
spe.ch.comp.coph <- cophenetic(spe.ch.complete) # Complete
paste("Complete cophenetic:", cor(spe.ch, spe.ch.comp.coph))
## [1] "Complete cophenetic: 0.765562801324477"
spe.ch.UPGMA.coph <- cophenetic(spe.ch.UPGMA) # Average
paste("UPGMA cophenetic:", cor(spe.ch, spe.ch.UPGMA.coph))
## [1] "UPGMA cophenetic: 0.860832629864453"
spe.ch.ward.coph <- cophenetic(spe.ch.ward) # Ward
paste("Ward cophenetic:", cor(spe.ch, spe.ch.ward.coph))
## [1] "Ward cophenetic: 0.759393401189188"
cor(spe.ch, spe.ch.ward.coph, method = "spearman")
## [1] 0.7661171
同表型相关的图示(谢泼德图 Shepard diagram)。越靠近对角线越好。
par(mfrow = c(2, 2))
plot(spe.ch, spe.ch.single.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
main = c("Single linkage", paste("Cophenetic correlation =",
round(cor(spe.ch, spe.ch.single.coph), 3))))
abline(0, 1); lines(lowess(spe.ch, spe.ch.single.coph), col = "red")
plot(spe.ch, spe.ch.comp.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)), ylim = c(0,sqrt(2)),
main = c("Complete linkage", paste("Cophenetic correlation =",
round(cor(spe.ch, spe.ch.comp.coph), 3))))
abline(0, 1); lines(lowess(spe.ch, spe.ch.comp.coph), col = "red")
plot(spe.ch, spe.ch.UPGMA.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)), ylim = c(0,sqrt(2)),
main = c("UPGMA", paste("Cophenetic correlation =",
round(cor(spe.ch, spe.ch.UPGMA.coph), 3))))
abline(0, 1); lines(lowess(spe.ch, spe.ch.UPGMA.coph), col = "red")
plot(spe.ch, spe.ch.ward.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)),
ylim = c(0,max(spe.ch.ward$height)),
main = c("Ward clustering", paste("Cophenetic correlation =",
round(cor(spe.ch, spe.ch.ward.coph), 3))))
abline(0, 1); lines(lowess(spe.ch, spe.ch.ward.coph), col = "red")
最佳拟合方法为UPGMA,同位相关系数为0.861,比较高。
高尔距离值(越低越好)
(gow.dist.single <- sum((spe.ch - spe.ch.single.coph) ^ 2))
## [1] 95.41391
(gow.dist.comp <- sum((spe.ch - spe.ch.comp.coph) ^ 2))
## [1] 40.48897
(gow.dist.UPGMA <- sum((spe.ch - spe.ch.UPGMA.coph) ^ 2))
## [1] 11.6746
(gow.dist.ward <- sum((spe.ch - spe.ch.ward.coph) ^ 2))
## [1] 8001.85
最佳方法仍然是UPGMA。
3- 最后聚类数目的选择
为了达到这个目的,我们需要 3 个不同的检验:
- a- Fussion 水平图
- b- Silhouette 图(轮廓系数图)
- c- Mantel 值
a- Fussion 水平图
dev.off()
## null device
## 1
plot(spe.ch.ward$height, nrow(spe) : 2, type = "S",
main = "Fusion levels - Chord - Ward",
ylab = "k (number of clusters)", xlab = "h (node height)", col = "grey")
text(spe.ch.ward$height, nrow(spe) : 2, nrow(spe) : 2, col = "red", cex = 0.8)
根据图结果,最佳数目是 2 或 4。
b- Silhouette 图
asw <- numeric(nrow(spe))
for(k in 2:(nrow(spe) - 1)){
sil <- silhouette(cutree(spe.ch.ward, k = k), spe.ch)
asw[k] <- summary(sil)$avg.width}
k.best <- which.max(asw)
plot(1: nrow(spe), asw, type="h",
main = "Silhouette-optimal number of clusters",
xlab = "k (number of groups)", ylab = "Average silhouette width")
axis(1, k.best, paste("optimum", k.best, sep = "\n"), col = "red", font = 2,
col.axis = "red")
points(k.best, max(asw), pch = 16, col = "red", cex = 1.5)
cat("", "Silhouette-optimal number of clusters k =", k.best, "\n",
"with an average silhouette width of", max(asw), "\n")
## Silhouette-optimal number of clusters k = 2
## with an average silhouette width of 0.3658319
c- Mantel 值
grpdist <- function(X){
require(cluster)
gr <- as.data.frame(as.factor(X))
distgr <- daisy(gr, "gower")
distgr}
kt <- data.frame(k = 1:nrow(spe), r = 0)
for(i in 2:(nrow(spe) - 1)){
gr <- cutree(spe.ch.ward, i)
distgr <- grpdist(gr)
mt <- cor(spe.ch, distgr, method = "pearson")
kt[i, 2] <- mt}
k.best <- which.max(kt$r)
plot(kt$k, kt$r, type = "h", main = "Mantel-optimal number of clusters",
xlab = "k (number of groups)", ylab = "Pearson's correlation")
axis(1, k.best, paste("optimum", k.best, sep = "\n"), col = "red", font = 2,
col.axis = "red")
points(k.best, max(kt$r), pch = 16, col = "red", cex = 1.5)
cat("", "Mantel-optimal number of clusters k =", k.best, "\n",
"with a matrix linear correlation of", max(kt$r), "\n")
## Mantel-optimal number of clusters k = 4
## with a matrix linear correlation of 0.7154912
第一和第二种方法表明,聚类的最佳数量是k = 2,而 Mantel 值表明,数量必须是4。由于理论表明物种的数量为4,我们将在k = 4时验证我们的模型。
4- 最后模型验证 k = 4
k <- 4
cutg <- cutree(spe.ch.ward, k = k)
sil <- silhouette(cutg, spe.ch)
rownames(sil) <- row.names(spe)
plot(sil, main = "Silhouette plot",
cex.names = 0.8, col = 2:(k + 1), nmax = 100)
这些组是一致的,但是第二组有两个未分类的值。
5- 最后的图
到目前为止,我们已经通过UPGMA方法将我们的数据分组为4个簇。现在我们将使用Francois Gillet(2012)创建的hcoplot函数来描述树图的行为。
hcoplot <- function(tree, diss, k,
title = paste("Reordered dendrogram from", deparse(tree$call), sep = "\n"))
{
require(gclus)
gr <- cutree(tree, k = k)
tor <- reorder.hclust(tree, diss)
plot(tor, hang = -1, xlab = paste(length(gr),"sites"), sub = paste(k, "clusters"),
main = title)
so <- gr[tor$order]
gro <- numeric(k)
for (i in 1:k)
{
gro[i] <- so[1]
if (i<k) so <- so[so != gro[i]]
}
rect.hclust(tor, k = k, border = gro + 1, cluster = gr)
legend("topright", paste("Cluster", 1:k), pch = 22, col = 2:(k + 1), bty = "n")
}
hcoplot(spe.ch.ward, spe.ch, k = 4)
非层次聚类 (NHC)
这次我们将做一个k均值聚类模型。
1-数据标准化
之前已经做过。
2- 选择聚类方法
set.seed(1)
spe.kmeans <- kmeans(spe.norm, centers = 4, nstart = 100)
我们创建了包含4组的模型,与之前的HC模型相同。这里是每一组的中心,每一组的方差。该模型可以解释总方差的66.7%。
3- 选择聚类数和模型验证
我们使用以下标准:
- Calinski & Harabasz 值
- Simple structure index (SSI)
- Sum of squared errors (SSE)
- Silhouette 图
3.1. Calinski method + SSI + SSE
spe.KM.cascade <- cascadeKM(spe.norm, inf.gr = 2, sup.gr = 10, iter = 100, criterion = "ssi")
spe.KM.cascade$results
## 2 groups 3 groups 4 groups 5 groups 6 groups 7 groups 8 groups
## SSE 8.2149405 6.4768108 5.0719796 4.3015573 3.5856120 2.9523667 2.48405487
## ssi 0.1312111 0.1675852 0.1244603 0.1149501 0.1281785 0.1306085 0.07275788
## 9 groups 10 groups
## SSE 2.052189 1.75992916
## ssi 0.126707 0.07094594
plot(spe.KM.cascade, sortg = TRUE)
img
统计数字不能决定一切。通过SSE方法,最好的聚类数必须是2,通过SSI方法则必须是3。
3.2. Silhouette 图
我们试着绘制 3 组的轮廓系数图。
spe.kmeans <- kmeans(spe.norm, centers = 3, nstart = 100)
dissE <- daisy(spe.norm)
sk <- silhouette(spe.kmeans$cl, dissE)
plot(sk)
4- 最后的图形和解释
现在我们继续比较前一个树状图和之前的分类。
spebc.ward.g <- cutree(spe.ch.ward,k = 4)
table(spe.kmeans$cluster, spebc.ward.g)
## spebc.ward.g
## 1 2 3 4
## 1 0 6 0 0
## 2 11 1 0 0
## 3 0 0 8 3
它们只在一种情况的分类上有所不同。
clusplot(spe.norm, spe.kmeans$cluster, color = TRUE, shade = TRUE,
labels = 2, lines = 0)
组与组之间有一些重叠,但我们解释了高达61.75%的变异性,这是一个很好的百分比。