单细胞测序分析rna_seq生物信息学技能

5、Clustering

2020-09-25  本文已影响0人  小贝学生信

原文链接Chapter 10 Clustering

1、基础知识

load("fluidigm.pca")
fluidigm.hvg
colnames(colData(fluidigm.hvg))
#可以看到此示例数据已由分类注释,我们可以据此验证后面的分群结果

2、方法method

简单介绍三种方法,一种是Seurat包采用的Graph-based;另外两种分别是k-means与层次聚类。

2.1 Graph-based

(1)基本原理

以每个细胞的高维度坐标为依据,寻找若干(参数k)与之最邻近的细胞。Two cells are connected by an edge if any of their nearest neighbors are shared,并会设置权重。

library(scran)
g <- buildSNNGraph(fluidigm.hvg, k=10, use.dimred = 'PCA')
#基于PCA结果的聚类
clust <- igraph::cluster_walktrap(g)$membership
table(clust) #分成四类
#clust
# 1  2  3  4 
#42 63  8 14 
(2)优点
g.5 <- buildSNNGraph(fluidigm.hvg, k=5, use.dimred = 'PCA')
clust.5 <- igraph::cluster_walktrap(g.5)$membership
table(clust.5)
#clust.5
# 1  2  3 
#44 57 26
g.50 <- buildSNNGraph(fluidigm.hvg, k=50, use.dimred = 'PCA')
clust.50 <- igraph::cluster_walktrap(g.50)$membership
table(clust.50)
#clust.50
# 1  2 
#85 42
(3)聚类评价 access
ratio <- clusterModularity(g, clust, as.ratio=TRUE)
dim(ratio)
library(pheatmap)
pheatmap(log2(ratio+1), cluster_rows=FALSE, cluster_cols=FALSE,
         color=colorRampPalette(c("white", "blue"))(100))

如下图,A dataset containing well-separated clusters should contain most of the observed total weight on the diagonal entries。即在对角线上的颜色对应的值最大(群的独立性越好)


clusterModularity
cluster.gr <- igraph::graph_from_adjacency_matrix(log2(ratio+1), 
                                                  mode="upper", weighted=TRUE, diag=FALSE)
set.seed(11001010)
plot(cluster.gr, edge.width=igraph::E(cluster.gr)$weight*5,
     layout=igraph::layout_with_lgl)

如下图,表示四个群之间的关系(线越粗,代表关系越紧密。理想的结果是越细越好)


clusters' nodes

2.2 K-means

set.seed(100)
clust.kmeans <- kmeans(reducedDim(fluidigm.hvg, "PCA"), centers=4)
#设置分为4类
table(clust.kmeans$cluster)
colData(fluidigm.hvg)$kmeans <- factor(clust.kmeans$cluster)
library(scater)
plotReducedDim(fluidigm.hvg, "TSNE", colour_by="kmeans")

-如上,kmeans的特点之一在于其必须设定要分类的群数,一般是缺点。因为我们处理未知数据,事先肯定不必知道centers值。

library(cluster)
set.seed(110010101)
#计算理想的k值,再kmeans
gaps <- clusGap(reducedDim(fluidigm.hvg, "PCA"), kmeans, K.max=20)
best.k <- maxSE(gaps$Tab[,"gap"], gaps$Tab[,"SE.sim"])
best.k
plot(gaps$Tab[,"gap"], xlab="Number of clusters", ylab="Gap statistic")
abline(v=best.k, col="red")
plot(dend)
library(dynamicTreeCut)
clust.fluidigm <- cutreeDynamic(tree.fluidigm, distM=as.matrix(dist.fluidigm),
                            minClusterSize=10, deepSplit=1)
table(clust.fluidigm)
#clust.fluidigm
# 1  2  3  4  5 
#34 30 25 24 14 
labels_colors(dend) <- clust.fluidigm[order.dendrogram(dend)]
plot(dend)
colData(fluidigm.hvg)$dend <- factor(clust.fluidigm)
plotReducedDim(fluidigm.hvg, "TSNE", colour_by="dend")
plot(dend)

3 聚类稳定性检验

myClusterFUN <- function(x) {
  g <- buildSNNGraph(x, use.dimred="PCA", type="jaccard")
  igraph::cluster_louvain(g)$membership
}

originals <- myClusterFUN(fluidigm.hvg)

set.seed(0010010100)
coassign <- bootstrapCluster(fluidigm.hvg, FUN=myClusterFUN, clusters=originals)
#检查original的稳定性
dim(coassign)
pheatmap(coassign, cluster_row=FALSE, cluster_col=FALSE,
         color=rev(viridis::magma(100)))

4、亚类再聚类

g <- buildSNNGraph(fluidigm.hvg, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
samp <- sample(rownames(fluidigm.hvg),4)
#随便绘制4个基因在不同clust的表达情况
plotExpression(fluidigm.hvg, features=samp,
               x=I(factor(clust)), colour_by=I(factor(clust)))
4-1
fluidigm.sub <- fluidigm.hvg[,clust==1] #挑选4分类的第一类
#注意此时fluidigm.hvg的assay是基于之前hvg筛选的矩阵。
#正常应按照original assay进行亚类分群
dim(fluidigm.sub)
fluidigm.sub.hvg <- modelGeneVar(fluidigm.sub)
fluidigm.sub <- denoisePCA(fluidigm.sub, technical=fluidigm.sub.hvg,
                         subset.row=getTopHVGs(fluidigm.sub.hvg, prop=0.1))
dim(fluidigm.sub)
g.sub <- buildSNNGraph(fluidigm.sub, use.dimred="PCA")
#分成了两个亚类
clust.sub <- igraph::cluster_walktrap(g.sub)$membership
plotExpression(fluidigm.sub, features=samp,
               x=I(factor(clust.sub)) , colour_by=I(factor(clust.sub)))
4-2
fluidigm.clust <- fluidigm.hvg
save(fluidigm.clust,file = "fluidigm.clust.RData")

以上是第十章Clustering部分的简单流程笔记。三种聚类方法的具体细节描述见Chapter 10 Clustering
本系列笔记基于OSCA全流程的大致流程梳理,详细原理可参考原文。如有错误,恳请指正!
此外还有刘小泽老师整理的全文翻译笔记,详见目录

上一篇 下一篇

猜你喜欢

热点阅读