学习一篇NC的单细胞文章(四):Clustering
我们上次基于各种marker对1189个细胞进行分类,然而,仅基于marker对细胞进行分类可能是不精确的,特别是考虑到scRNA-seq数据的high dropout rate 。因此,再进行t-SNE降维之前,作者又进一步将细胞进行分类。
首先需要对不同病人样本的效应进行回归分析,以确保得到的聚类结果不是患者样本之间的异质性导致的。然后选择了在细胞间分散度高的基因表达子集(平均表达量>0.1),以增加簇的稳健性。首先计算经验离散值(empirical dispersion value )去拟合离散度-均值(dispersion parameter )的关系,为每个基因选择离散度参数(dispersion-mean relationship )。
# 将得到的1189个细胞分类结果更新到sceset_final中
colData(sceset_final)$cell_types_markers <- cell_types_simple
pd_norm <- colData(sceset_final)
#使用monocle包对细胞类型进行无监督聚类方法,monocle_unsup_clust_plots是包装好的一个函数,
HSMM_clustering_ct <- monocle_unsup_clust_plots(sceset_obj = sceset_final, mat_to_cluster = mat_norm, anno_colors = anno_colors,
name_in_phenodata = "cluster_allregr_disp", disp_extra = 1, save_plots = 0,
path_plots = NULL, type_pats = "allpats", regress_pat = 1)
HSMM_clustering_ct$Cluster <- HSMM_clustering_ct$cluster_allregr_disp
table(HSMM_clustering_ct$Cluster)
> table(HSMM_clustering_ct$Cluster)
1 2 3 4 5 6 7
64 141 95 74 202 559 54
我们看到,此时可分为7簇,跟文中的9簇不一样,作者这么解释:由于reduceDimension and clusterCells函数的更新,因此结果会有所不同。
# due to changes in Monocle's functions (reduceDimension and clusterCells), the resulting clustering is slightly different from
# the original clustering from the paper. for reproducibility, we read in the original cluster assignment
为了更好的的进行下面的学习,我们就拿已经处理好的original clustering跑下去看看。
#读取已经处理的的original clustering
original_clustering <- readRDS(file="data/original_clustering.RDS")
HSMM_clustering_ct$Cluster <- original_clustering
table(HSMM_clustering_ct$Cluster)
> table(HSMM_clustering_ct$Cluster)
1 2 3 4 5 6 7 8 9
52 18 156 197 144 119 129 328 46
这时候就有9簇细胞,接下来需要对每1簇细胞再进行细分。
# 将得到的9簇细胞分类结果更新到sceset_final中
colData(sceset_final) <- cbind(colData(sceset_final), pData(HSMM_clustering_ct)[,c(96:104)])
colData(sceset_final)$cell_types_cl_all <- colData(sceset_final)$cell_types_markers
pd_norm <- colData(sceset_final)
#首先在每个簇中确定unknown and undecided cell types
cells_to_assign <- list()
cells_to_reassign <- list()
mean_exprs <- list()
mean_reassign_exprs <- list()
# cluster1: 只有52 epithelial cells
cluster_here <- 1
table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
mean_exprs[[cluster_here]] <- NULL
mean_reassign_exprs[[cluster_here]] <- NULL
cells_to_assign[[cluster_here]] <- NULL
cells_to_reassign[[cluster_here]] <- NULL
#cluster2:确定clsters2的大致细胞类型:3 epithelial, 13 macrophage, 1 undecided and 1 unknown cells
cluster_here <- 2
table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
> table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
epithelial macrophage undecided unknown
3 13 1 1
#计算undecided and unknown cells的免疫标志物的平均表达量
to_assign_cluster <- c("undecided", "unknown")
mean_exprs[[cluster_here]] <- compute_mean_expr_types(types = c("epithelial", "immune", "other"), mat_expr = mat_norm,
cells_pos = intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% to_assign_cluster)),
epithelial_markers = epithelial_markers, immune_markers = immune_markers, other_markers = other_markers)
mean_exprs[[cluster_here]]
#对于undecided 和 unknown cell,如果免疫标志物的平均表达量最高,指定为巨噬细胞。
cells_to_assign[[cluster_here]] <- intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% to_assign_cluster))[which(apply(mean_exprs[[cluster_here]], 1, function(x){if (x[2] >= max(x)) return(1) else {return(0)}}) == 1)]
pd_norm$cell_types_cl_all[cells_to_assign[[cluster_here]]] <- "macrophage"
table(pd_norm$cell_types_cl_all[which(pd_norm$Cluster == cluster_here)])
#确定clusters2中3个上皮细胞(epithelial cells)免疫标志物的平均表达水平
mean_reassign_exprs[[cluster_here]] <- compute_mean_expr_types(types = c("epithelial", "immune", "other"), mat_expr = mat_norm,
cells_pos = intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% "epithelial")),
epithelial_markers = epithelial_markers, immune_markers = immune_markers, other_markers = other_markers)
#类似地,clusters2有3个上皮细胞(epithelial cells)免疫标志物的平均表达最强,也被指定为巨噬细胞。
mean_reassign_exprs[[cluster_here]] <- compute_mean_expr_types(types = c("epithelial", "immune", "other"), mat_expr = mat_norm,
cells_pos = intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% "epithelial")),
epithelial_markers = epithelial_markers, immune_markers = immune_markers, other_markers = other_markers)
cells_to_reassign[[cluster_here]] <- intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% "epithelial"))[which(apply(mean_reassign_exprs[[cluster_here]], 1, function(x){if (x[2] >= max(x)) return(1) else {return(0)}}) == 1)]
pd_norm$cell_types_cl_all[cells_to_reassign[[cluster_here]]] <- "macrophage"
#因此,clusters2由18个巨噬细胞组成。
table(pd_norm$cell_types_cl_all[which(pd_norm$Cluster == cluster_here)])
## cluster3:46 epithelial, 86 stroma, 14 endothelial, 5 undecided and 5 unknown cells.
cluster_here <- 3
table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
> table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
endothelial epithelial stroma undecided unknown
14 46 86 5 5
mean_exprs[[cluster_here]] <- NULL
mean_reassign_exprs[[cluster_here]] <- NULL
cells_to_assign[[cluster_here]] <- NULL
cells_to_reassign[[cluster_here]] <- NULL
# cluster 4: 将unknown and undecided cells进一步细分
cluster_here <- 4
table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
to_assign_cluster <- c("undecided", "unknown")
mean_exprs[[cluster_here]] <- compute_mean_expr_types(types = c("epithelial", "immune", "other"), mat_expr = mat_norm,
cells_pos = intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% to_assign_cluster)),
epithelial_markers = epithelial_markers, immune_markers = immune_markers, other_markers = other_markers)
# clusters3中unknown and undecided cells共有5个细胞的epithelial markers的平均表达最强,被指定为epithelial cells
cells_to_assign[[cluster_here]] <- intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% to_assign_cluster))[which(apply(mean_exprs[[cluster_here]], 1, function(x){if (x[1] >= max(x)) return(1) else {return(0)}}) == 1)]
pd_norm$cell_types_cl_all[cells_to_assign[[cluster_here]]] <- "epithelial"
table(pd_norm$cell_types_cl_all[which(pd_norm$Cluster == cluster_here)])
# 确定个stroma cell是否高表达epithelial marker,若epithelial marker表达较高,则被指定为 epithelial cells
mean_reassign_exprs[[cluster_here]] <- compute_mean_expr_types(types = c("epithelial", "immune", "other"), mat_expr = mat_norm,
cells_pos = intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% "stroma")),
epithelial_markers = epithelial_markers, immune_markers = immune_markers, other_markers = other_markers)
# only re-assign the stroma cells if their epithelial expression is higher
cells_to_reassign[[cluster_here]] <- intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% "stroma"))[which(apply(mean_reassign_exprs[[cluster_here]], 1, function(x){if (x[1] >= max(x)) return(1) else {return(0)}}) == 1)]
pd_norm$cell_types_cl_all[cells_to_reassign[[cluster_here]]] <- "epithelial"
table(pd_norm$cell_types_cl_all[which(pd_norm$Cluster == cluster_here)])
cluster 5:46 epithelial, 4 stroma, 52 T, 3 B, 11 undecided and 28 unknown cells
cluster_here <- 5
table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
mean_exprs[[cluster_here]] <- NULL
mean_reassign_exprs[[cluster_here]] <- NULL
cells_to_assign[[cluster_here]] <- NULL
cells_to_reassign[[cluster_here]] <- NULL
接下来的第6、7、8和9簇细胞,按照同样的办法细分,代码太长了,我就不贴上来了。在这里先小结一下,我们开始看到的是1189个细胞,然后将这些细胞用无聚类监督分析先分为9个簇,因为相似的细胞更容易成为一个簇,但是簇里面的细胞类型我们已经基于文献的markers分类好了,若簇里面有多种细胞类型的细胞存在,那么我们需要进一步鉴定他们跟簇里面那群最大比例的细胞是不是同一类细胞(不然怎么解释为何他们会聚集到了同一簇),还是说有自己的独特身份?但是我们怎么定义那个最大比例的细胞,作者将这个比例定为80%,举个例子,在clusters4总共有197个细胞(185 epithelial,3 stroma, 4 undecided and 5 unknown cells),其中 epithelial细胞占了94%,超过了80%的比例,那么其他12个细胞有可能“伪装”了自己,其真实身份有可能是epithelial细胞,不然怎么解释他们跟epithelial细胞聚集到了一起呢?这时候就需要进一步确定其他12个细胞epithelial markers 平均表达量是否高于其他的markers,是的话他们的“庐山真面目”就是epithelial细胞,若没有,那么就是冤枉到它了,让它保持原来的身份。而对于clusters3( 46 epithelial, 86 stroma, 14 endothelial, 5 undecided and 5 unknown cells)来说,没有1种细胞的比例超过80%,因此他们都可以保持自己的身份。我感觉这个过程还是蛮有趣的,在对核实细胞身份的同时,其实就是不断接近客观世界的过程。今天这些内容已经蛮多了,下一节我们再继续学习。