Monocle2包学习笔记（二）：Classifying and

2020-04-23 本文已影响0人 Davey1220

image

在单细胞实验中，我们所取的组织样本通常是混杂有多种细胞类型的复杂混合物。解离的组织样本可能包含有两种，三种甚至许多不同的细胞类型。在这种情况下，最好使用已知的marker标记基因根据基因的表达对细胞进行分类。

Classifying cells by type

Monocle2提供了一个简单的系统，可根据我们选择的marker基因的表达来对不同的细胞类型进行标记。

# 选择marker标记基因
MYF5_id <- row.names(subset(fData(HSMM), gene_short_name == "MYF5"))
ANPEP_id <- row.names(subset(fData(HSMM), gene_short_name == "ANPEP"))

cth <- newCellTypeHierarchy()
# 根据MYF5基因的表达标记Myoblast细胞
cth <- addCellType(cth, "Myoblast", classify_func = function(x)
 { x[MYF5_id,] >= 1 })
# 根据MYF5和ANPEP基因的表达标记Fibroblast细胞
cth <- addCellType(cth, "Fibroblast", classify_func = function(x)
 { x[MYF5_id,] < 1 & x[ANPEP_id,] > 1 })

# 使用classifyCells函数对细胞进行分类
HSMM <- classifyCells(HSMM, cth, 0.1)
# 查看分类的结果
table(pData(HSMM)$CellType)

Fibroblast  Myoblast    Unknown
56  85  121

Monocle2使用一个小型的数据结构CellTypeHierarchy对细胞进行分类。首先，初始化一个新的CellTypeHierarchy对象，然后在其中根据筛选条件添加所需标记的细胞类型，建立好数据结构后，我们就可以使用classifyCells函数对所有的细胞进行分类，分类完成后就会在返回的CellDataSet对象中添加了一个新的CellType列，该列中存储着细胞分类后的结果。

# 对细胞分类后的结果进行可视化
pie <- ggplot(pData(HSMM),
aes(x = factor(1), fill = factor(CellType))) + geom_bar(width = 1)
pie + coord_polar(theta = "y") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank())

image

从上图可以看出，许多细胞被分类标记为“Unknown”类型。这很常见，主要是因为在大多数单细胞RNA-Seq实验中，mRNA的捕获率较低。如果一个细胞表达很少量的MYF5 mRNA，我们有可能无法捕获到它。当细胞不满足分类功能中指定的任何条件时，就会将其标记为“Unknown”类型。如果满足多个筛选条件，则会标记为“Ambiguous”类型。虽然我们可以预先排除此类细胞，但会丢弃很多数据。在这种情况下，我们将可能会损失一半以上的细胞！

Clustering cells without marker genes（不依赖marker基因对细胞进行聚类分群）

Monocle2提供了可用于对未知细胞类型进行聚类分群的算法，它通过clusterCells函数根据全局基因的表达情况实现对细胞的聚类分群。clusterCells函数既可以以无监督聚类的方式使用，也可以以半监督聚类的方式使用。
首先，我们看一下无监督聚类模式分类的方法，即不依赖marker基因的表达对细胞进行聚类分群：

disp_table <- dispersionTable(HSMM)
# 筛选基因
unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1)
# 过滤基因
HSMM <- setOrderingFilter(HSMM, unsup_clustering_genes$gene_id)
plot_ordering_genes(HSMM)

image

首先，我们决定选用哪些基因用于后续的细胞聚类分群。当然，我们可以选择直接使用所有的基因，但是这些基因中会存在一些表达水平不足以提供有意义的信号，而且也会增加许多系统的噪音。所以，我们可以根据基因的平均表达水平对其进行一定的过滤，还可以选择那些在不同细胞中表达变化较大的基因。这些基因倾向于提供有关细胞状态的有用信息。
plot_ordering_genes函数显示了基因表达的变异性（分散性）相对于整个细胞平均表达的分布。其中，红线表示Monocle2基于此关系拟合的变异期望值，黑点标记的为用于后续聚类的基因，而其他基因显示为灰点。

# HSMM@auxClusteringData[["tSNE"]]$variance_explained <- NULL
plot_pc_variance_explained(HSMM, return_all = F) # norm_method='log'

image

# 降维可视化
# 使用tSNE方法进行降维
HSMM <- reduceDimension(HSMM, max_components = 2, num_dim = 6,
                reduction_method = 'tSNE', verbose = T)
# 使用clusterCells方法进行聚类
HSMM <- clusterCells(HSMM, num_clusters = 2)
# 使用plot_cell_clusters函数对细胞聚类后的结果进行可视化
plot_cell_clusters(HSMM, 1, 2, color = "CellType",
    markers = c("MYF5", "ANPEP"))

image

plot_cell_clusters(HSMM, 1, 2, color = "Media")

image

Monocle2还允许我们消除那些“uninteresting”变异源的影响，以减少它们对聚类的影响，我们可以使用residualModelFormulaStr参数来执行此操作。

HSMM <- reduceDimension(HSMM, max_components = 2, num_dim = 2,
            reduction_method = 'tSNE',
            residualModelFormulaStr = "~Media + num_genes_expressed",
            verbose = T)
HSMM <- clusterCells(HSMM, num_clusters = 2)
plot_cell_clusters(HSMM, 1, 2, color = "CellType")

image

HSMM <- clusterCells(HSMM, num_clusters = 2)
plot_cell_clusters(HSMM, 1, 2, color = "Cluster") +
    facet_wrap(~CellType)

image

Clustering cells using marker genes（依赖marker基因对细胞进行聚类分群）

接下来，我们看一下半监督聚类模式分类的方法，即依赖marker基因的表达对细胞进行聚类分群：
First, we'll select a different set of genes to use for clustering the cells. Before we just picked genes that were highly expressed and highly variable. Now, we'll pick genes that co-vary with our markers. In a sense, we'll be building a large list of genes to use as markers, so that even if a cell doesn't have MYF5, it might be recognizable as a myoblast based on other genes.

marker_diff <- markerDiffTable(
            HSMM[expressed_genes,],
            cth,
            residualModelFormulaStr = "~Media + num_genes_expressed",
            cores = 1)

markerDiffTable函数接受CellDataSet和CellTypeHierarchy作为输入，并根据提供的函数将所有的细胞进行分类。然后，先删除所有“Unknown”和“Ambiguous”的细胞，并识别不同细胞类型之间差异表达的基因。在该函数返回的结果中，我们可以选择要用于后续细胞聚类的基因。通常，最好选择每种细胞类型最特异的前10个或20个基因，这确保了聚类基因不会被某一种细胞类型的标记所控制。

# 选择用于细胞聚类的基因
candidate_clustering_genes <- row.names(subset(marker_diff, qval < 0.01))
# 使用calculateMarkerSpecificity函数计算不同细胞类型特异的marker基因
marker_spec <- calculateMarkerSpecificity(HSMM[candidate_clustering_genes,], cth)
# 查看top marker基因
head(selectTopMarkers(marker_spec, 3))

这里，我们选择每种细胞类型top 500的marker基因用于后续的细胞聚类:

# 选择marker基因
semisup_clustering_genes <- unique(selectTopMarkers(marker_spec, 500)$gene_id)
# 基因过滤
HSMM <- setOrderingFilter(HSMM, semisup_clustering_genes)
plot_ordering_genes(HSMM)

image

plot_pc_variance_explained(HSMM, return_all = F)

image

# 细胞降维聚类可视化
HSMM <- reduceDimension(HSMM, max_components = 2, num_dim = 3,
  norm_method = 'log',
  reduction_method = 'tSNE',
  residualModelFormulaStr = "~Media + num_genes_expressed",
  verbose = T)
HSMM <- clusterCells(HSMM, num_clusters = 2)
plot_cell_clusters(HSMM, 1, 2, color = "CellType")

image

Imputing cell type 估算填充细胞类型

如果为clusterCells函数提供CellTypeHierarcy，Monocle2将使用它对整个细胞集群进行分类，而不仅仅是单个的细胞。实际上，clusterCells的工作原理与以前完全相同，只是在构建细胞cluster之后，它会计算每个cluster中每种细胞类型占比的频率。
当某个cluster由某种细胞类型的特定百分比（如10％）组成时，该cluster中的所有细胞都将会被标记为该类型。如果某个cluster由多个细胞类型组成，则整个cluster都标记为“Ambiguous”。如果没有超出特定百分比阈值的细胞类型，则将该cluster标记为“Unknown”。因此，即使缺少marker标记基因，Monocle2也可以帮助我们估算填充每个细胞的的类型。

HSMM <- clusterCells(HSMM,
              cell_type_hierarchy = cth,
              num_clusters = 2,
              frequency_thresh = 0.1)
plot_cell_clusters(HSMM, 1, 2, color = "CellType",
    markers = c("MYF5", "ANPEP"))

image

# 分类结果可视化
pie <- ggplot(pData(HSMM), 
              aes(x = factor(1), 
              fill = factor(CellType))) +
       geom_bar(width = 1)

pie + coord_polar(theta = "y") +
      theme(axis.title.x = element_blank(), axis.title.y = element_blank())

image

参考来源：http://cole-trapnell-lab.github.io/monocle-release/docs/

image

Monocle2包学习笔记（二）：Classifying and

image

Classifying cells by type

Clustering cells without marker genes（不依赖marker基因对细胞进行聚类分群）

Clustering cells using marker genes（依赖marker基因对细胞进行聚类分群）

Imputing cell type 估算填充细胞类型

猜你喜欢

热点阅读