Monocle2 学习笔记

2020-04-11 本文已影响0人 caokai001

参考：

测试所用的数据链接: https://pan.baidu.com/s/1TUysqHHbXrWGi7QGt-6L1g 提取码: uqgh

image.png

Monocle[7]是一个基于pseudotime分析的软件包。它通过从一系列差异表达的基因来描述细胞间的距离远近，由此得到类似树状的分叉图象来表示细胞间的差异性。提供了聚类，pseudotime, 差异分析等多种功能，该项目的网址如下

https://cole-trapnell-lab.github.io/monocle-release/

image.png

主要内容

1. 读取数据
2.预处理过滤细胞
3. 细胞分群/聚类
step1：dispersionTable()
step2：plot_pc_variance_explained() 耗时
step3: 进行降维和聚类
4.筛选基因(用于拟分析时序)
4.1 一般过滤标准
4.2 使用差异表达基因BEAM分析
5.推断发育轨迹
5.1. 选合适基因
5.2 降维
5.3 细胞排序
小结：

主要介绍使用该R包进行pseudotime分析的步骤.

Tips : pseudotime分析本质上就是输入基因及其表达量矩阵，计算拟时序的值。并且选取一些基因，用于降维可视化。聚类只是降维后颜色填充而已.

1. 读取数据

monocle包有很多种读取数据的方式，这里只展示其中三种：（I）直接读取;（II）来源于seurat2 包; （III）来源于seurat3包。

Tips : 由于版本更替较快，seurat2 可以直接导入monocle2.但是seurat3却要手动写代码导入.

library(monocle)
library(DDRTree)
library(pheatmap)
library(data.table)
library(plyr)

下面以10x 数据为例，最好通过seurat对象直接读入monocle2 方便一些.

（1）直接读取：当有三个文件，expr_matrix ; sample_info ;gene_annotation.

expr_matrix <- read.delim(nUMI.summary.file)
sample_info <- data.frame(sampleID=colnames(expr_matrix))
gene_annotation <- data.frame(symbol=rownames(expr_matrix))
pd <- new("AnnotatedDataFrame", data = sample_info)
fd <- new("AnnotatedDataFrame", data = gene_annotation)
d <- newCellDataSet(as.matrix(expr_matrix), phenoData = pd, featureData = fd)

（2）来源于seurat2 包，数据如下：

image.png

## 切换到工作目录，读入三个文本文件.
pbmc.data <- Read10X(data.dir = ".")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "10XPBMC")
## seurat2 可以直接读入
sce <- importCDS(pbmc)

（3）来源于seurat3包：

差别之处，不能直接用importCDS 导入.

## 由seurat3 导入函数newimport 替代上面的importCDS功能.

newimport <- function(otherCDS, import_all = FALSE) {
  if(class(otherCDS)[1] == 'Seurat') {
    requireNamespace("Seurat")
    data <- otherCDS@assays$RNA@counts
    
    if(class(data) == "data.frame") {
      data <- as(as.matrix(data), "sparseMatrix")
    }
    
    pd <- tryCatch( {
      pd <- new("AnnotatedDataFrame", data = otherCDS@meta.data)
      pd
    },
    #warning = function(w) { },
    error = function(e) {
      pData <- data.frame(cell_id = colnames(data), row.names = colnames(data))
      pd <- new("AnnotatedDataFrame", data = pData)
      
      message("This Seurat object doesn't provide any meta data");
      pd
    })
    
    # remove filtered cells from Seurat
    if(length(setdiff(colnames(data), rownames(pd))) > 0) {
      data <- data[, rownames(pd)]
    }
    
    fData <- data.frame(gene_short_name = row.names(data), row.names = row.names(data))
    fd <- new("AnnotatedDataFrame", data = fData)
    lowerDetectionLimit <- 0
    
    if(all(data == floor(data))) {
      expressionFamily <- negbinomial.size()
    } else if(any(data < 0)){
      expressionFamily <- uninormal()
    } else {
      expressionFamily <- tobit()
    }
    
    valid_data <- data[, row.names(pd)]
    
    monocle_cds <- newCellDataSet(data,
                                  phenoData = pd,
                                  featureData = fd,
                                  lowerDetectionLimit=lowerDetectionLimit,
                                  expressionFamily=expressionFamily)
    
    if(import_all) {
      if("Monocle" %in% names(otherCDS@misc)) {
        otherCDS@misc$Monocle@auxClusteringData$seurat <- NULL
        otherCDS@misc$Monocle@auxClusteringData$scran <- NULL
        
        monocle_cds <- otherCDS@misc$Monocle
        mist_list <- otherCDS
        
      } else {
        # mist_list <- list(ident = ident)
        mist_list <- otherCDS
      }
    } else {
      mist_list <- list()
    }
    
    if(1==1) {
      var.genes <- setOrderingFilter(monocle_cds, otherCDS@assays$RNA@var.features)
      
    }
    monocle_cds@auxClusteringData$seurat <- mist_list
    
  } else if (class(otherCDS)[1] == 'SCESet') {
    requireNamespace("scater")
    
    message('Converting the exprs data in log scale back to original scale ...')
    data <- 2^otherCDS@assayData$exprs - otherCDS@logExprsOffset
    
    fd <- otherCDS@featureData
    pd <- otherCDS@phenoData
    experimentData = otherCDS@experimentData
    if("is.expr" %in% slotNames(otherCDS))
      lowerDetectionLimit <- otherCDS@is.expr
    else
      lowerDetectionLimit <- 1
    
    if(all(data == floor(data))) {
      expressionFamily <- negbinomial.size()
    } else if(any(data < 0)){
      expressionFamily <- uninormal()
    } else {
      expressionFamily <- tobit()
    }
    
    if(import_all) {
      # mist_list <- list(iotherCDS@sc3,
      #                   otherCDS@reducedDimension)
      mist_list <- otherCDS
      
    } else {
      mist_list <- list()
    }
    
    monocle_cds <- newCellDataSet(data,
                                  phenoData = pd,
                                  featureData = fd,
                                  lowerDetectionLimit=lowerDetectionLimit,
                                  expressionFamily=expressionFamily)
    # monocle_cds@auxClusteringData$sc3 <- otherCDS@sc3
    # monocle_cds@auxOrderingData$scran <- mist_list
    
    monocle_cds@auxOrderingData$scran <- mist_list
    
  } else {
    stop('the object type you want to export to is not supported yet')
  }
  
  return(monocle_cds)
}

导入monocle2

## 读入seurat3
pbmc.data <- Read10X(data.dir = ".")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "10XPBMC")

## 读入monocle2
sce <- newimport(pbmc)

本文采用第1种方式，读入文件

library(Seurat)
library(monocle)

expression_matrix <- readRDS("packer_embryo_expression.rds")
cell_metadata <- readRDS("packer_embryo_colData.rds")
gene_annotation <- readRDS("packer_embryo_rowData.rds")

## 数据框需要先保存为AnnotatedDataFrame 格式，才可以创建newCellDataSet 对象
# Error: CellDataSet 'phenoData' is class 'data.frame' but should be or extend 'AnnotatedDataFrame'
pd <- new("AnnotatedDataFrame", data = cell_metadata)
fd <- new("AnnotatedDataFrame", data = gene_annotation)

sce <- newCellDataSet(expression_matrix,
                      phenoData = pd, 
                      featureData = fd,
                      lowerDetectionLimit = 0.1, 
                      expressionFamily = VGAM::negbinomial.size()) # 默认参数

查看sce 类型：

> sce
CellDataSet (storageMode: environment)
assayData: 20222 features, 6188 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: AAACCTGCAAGACGTG-300.1.1 AAACCTGGTGTGAATA-300.1.1 ...
    TTTGTCAAGTACACCT-b02 (6188 total)
  varLabels: cell n.umi ... State (21 total)
  varMetadata: labelDescription
featureData
  featureNames: WBGene00010957 WBGene00010958 ... WBGene00007064 (20222 total)
  fvarLabels: id gene_short_name num_cells_expressed use_for_ordering
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:

查看一下数据读入到monocle2里面，它的表达分布类型：

sce@expressionFamily

## Family:  negbinomial.size 
## 
## Negative-binomial distribution with size known
## 
## Links:    loge(mu)
## Mean:     mu
## Variance: mu * (1 + mu / size) for NB-2

注意到构建CDS对象过程中有一个参数是：expressionFamily ，它是选择了一个数据分布，例如FPKM/TPM 值是log-正态分布的；UMIs和原始count值用负二项分布模拟的效果更好。负二项分布有两种方法，这里选用了negbinomial.size，另外一种negbinomial稍微更准确一点，但速度大打折扣，它主要针对非常小的数据集，如下表：

由于输入10x genomic 数count 类似，所以表达分布为 negbinomial.size()

Family function	数据类型	备注
negbinomial.size()	原始计数值, 比如UMIs counts	Negative binomial distribution with fixed variance
negbinomial()	原始计数值, 比如UMIs counts	比negbinomial.size()精确一点, 但是非常慢.
tobit()	FPKM, TPM	Tobits are truncated normal distributions. 注意这点不是log-transformed FPKM/TPM
gaussianff()	log-transformed FPKM/TPMs, Ct values from single-cell qPCR	符合高期分布的数值

2.预处理

和处理RNA-seq数据一样，monocle会有一些预处理的步骤，这包括估计size factors和dispersions，以及去除一些质量比较差的细胞。

第一行代码用于评估每个细胞的sizefactor, 用于后续归一化；
第二行代码计算基因的方差,这一步可能会提醒我们有多少的outlier的细胞。所以下一步就是要去除掉这些细胞。

cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds)
# Removing 143 outliers

通过pData和fData两个函数，可以查看基因和细胞的相关信息，示意如下

查看细胞注释信息：pData.

image.png

查看基因注释信息：fData

image.png

过滤细胞

在准备样品细胞的过程中，不可避免地会引入一下死的细胞，或者说是doublets(两个细胞被标记成一个细胞)。这些细胞对下游分析的影响会比较大，它们很有可能会占去很大分析权重。所以我们需要把它们去除掉。去除它们的办法就是设置一个比较合理的表达总值的上下限，然后把处于上下限外围的细胞都去除掉。

首先我们来查看一下低表达的基因。

### detectGenes计算每一个细胞表达的基因个数.
sce <- detectGenes(sce, min_expr = 0.1)
### 保留在10个细胞中表达的基因，当做表达基因.
expressed_genes <- row.names(subset(fData(sce),
                                    num_cells_expressed >= 10))

我们将表达值在10以上的基因保存为expressed_genes。下面的步骤会用到保存好的表达的基因。

接下来我们要去除掉死细胞或者doublets了。
```
head(pData(sce))
```

一般的，我们是没有数据告诉我们哪些细胞是死的或者空的，哪些细胞是doublets的。但是通常而方言，我们都会拿到RPC(reads per cell)的计数，我们可以人为的设定一下表达总值的上下限。

pData(sce)$Total_mRNAs <- Matrix::colSums(exprs(sce))
sce <- sce[, pData(sce)$Total_mRNAs < 1e6 ]

##设置上下限：
upper_bound <- 10^(mean(log10(pData(sce)$Total_mRNAs)) +
                     2*sd(log10(pData(sce)$Total_mRNAs)))
lower_bound <- 10^(mean(log10(pData(sce)$Total_mRNAs)) -
                     2*sd(log10(pData(sce)$Total_mRNAs)))

## 可视化
qplot(Total_mRNAs, data = pData(sce), geom = "density") +
  geom_vline(xintercept = lower_bound) +
  geom_vline(xintercept = upper_bound) + xlim(0, 6000)+
  theme_classic()

可视化RPC(reads per cell) 分布

image.png

过滤不合格细胞

## 保留下通过标准的细胞
sce <- sce[,pData(sce)$Total_mRNAs > lower_bound &
             pData(sce)$Total_mRNAs < upper_bound]

## 新的sce 数据,评估细胞表达基因数.
sce <- detectGenes(sce, min_expr = 0.1)

> sce
CellDataSet (storageMode: environment)
assayData: 20222 features, 5943 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: AAACCTGCAAGACGTG-300.1.1 AAACCTGGTGTGAATA-300.1.1
    ... TTTGTCAAGTACACCT-b02 (5943 total)
  varLabels: cell n.umi ... Total_mRNAs (20 total)
  varMetadata: labelDescription
featureData
  featureNames: WBGene00010957 WBGene00010958 ... WBGene00007064
    (20222 total)
  fvarLabels: id gene_short_name num_cells_expressed
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:

细胞数目有6188 变成现在的5943个.

检查过滤细胞的效果，过滤后的值基本上是符合lognormal分布的。

# Log-transform each value in the expression matrix.
L <- log(exprs(sce[expressed_genes,]))

# Standardize each gene, so that they are all on the same scale,
# Then melt the data with plyr so we can plot it easily
melted_dens_df <- melt(Matrix::t(scale(Matrix::t(L))))

# Plot the distribution of the standardized gene expression values.
qplot(value, geom = "density", data = melted_dens_df) +
  stat_function(fun = dnorm, size = 0.5, color = 'red') +
  xlab("Standardized log(counts)") +
  ylab("Density")

理想的图：

image.png

得到过滤后的矩阵，可以分别进行聚类和拟时序分析.

3. 细胞分群/聚类

不使用marker 基因的细胞分类方法

使用函数clusterCells()，根据整体的表达量对细胞进行分组。例如，细胞表达了大量的与成肌细胞相关的基因，但就是没有成肌细胞的marker--MYF5 ，我们依然可以判断这个细胞属于成肌细胞。

step1：dispersionTable()

首先就是判断使用哪些基因进行细胞分群。当然，可以使用全部基因，但这会掺杂很多表达量不高而检测不出来的基因，反而会增加噪音。挑有差异的，挑表达量不太低的

disp_table <- dispersionTable(sce)
unsup_clustering_genes <- subset(disp_table, 
                                 mean_expression >= 0.1) ## 数值因实验而不同
sce <- setOrderingFilter(sce, unsup_clustering_genes$gene_id) 
plot_ordering_genes(sce)

image.png

其中, setOrderingFilter函数设置了将来会用于分群的细胞。这一步对于分群非常关键，所以可以试着使用不同的方法来过滤基因，以期达到满意的效果。

plot_ordering_genes根据这些基因的平均表达水平展示了基因的表达差异程度，红线表示Monocle基于这种关系对分散的期望。上图中实黑的为会使用的基因，灰色的是过滤掉的基因。

step2：plot_pc_variance_explained() 耗时

然后选一下主成分

plot_pc_variance_explained(cds, return_all = F,max_components = 100) # norm_method='log'

image.png

上图与Seurat中使用到的PCElbowPlot类似，我们可以看到大约多少的PC将会被我们使用。可以看到大约在15的时候就已经到的平原区了。下面试着分群。

step3: 进行降维和聚类

根据上面👆的图，选择合适的主成分数量（这个很主观，可以多试几次），这里选前15个成分（需要注意的使用主成分会影响后面结果）

关于处理批次效应：例如在芯片数据中经常会利用SVA的combat函数。
磨平批次效应实际上就是去掉各个组的前几个主成分

# 进行降维 ，设置降维维度15.
cds <- reduceDimension(sce, max_components = 2, norm_method = 'log',
                       num_dim = 15, reduction_method = "tSNE", verbose = TRUE)
# Remove noise by PCA ...
# Reduce dimension by tSNE ...



### 如果需要去除批次效应(batch 和 几群细胞基因表达量数目)
# cds <- reduceDimension(cds, max_components = 2, num_dim = 6,
#                          reduction_method = 'tSNE',
#                          residualModelFormulaStr = "~batch + #num_genes_expressed",
#                          verbose = T)



# 进行聚类 （指定cluster为4的时候，只能画出来3个,为什么？）
cds <- clusterCells(cds, num_clusters = 4) 
# Distance cutoff calculated to 4.826512 

## 查看聚类效果
head(cds@phenoData@data)


# 查看tSNE图
plot_cell_clusters(cds,x=1,y=2, color_by = "Cluster")

# 画具体基因的表达分布图
#plot_cell_clusters(HSMM, 1, 2, color = "Cluster",
# +                    markers = c("CDC20", "GABPB2"))

image.png

4.筛选基因(用于拟分析时序)

4.1 一般过滤标准

筛选基因没有一个固定标准，可以根据自己的需要灵活选择，这里只展示其中1种 .

disp_table <- dispersionTable(sce)
unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1)

## 得到用于拟时序分析基因
ordering_genes <- row.names(unsup_clustering_genes)

image.png

4.2 使用差异表达基因

对于差异表达分析，可以像RNA-seq一样，直接到不同属性的样品进行分析

fullModelFormulaStr ：分组进行差异分析的变量.

我们可以看到，只要有不同的pData属性，我们都可以做差异表达分析。比如State：

## 差异分析非常耗时，运行时间在几分钟至十几分钟不等
diff_test_res <- differentialGeneTest(sce[expressed_genes, ],
                                      fullModelFormulaStr = "Cluster")
sig_genes <- subset(diff_test_res, qval < 0.1)
sig_genes[1:6,c("gene_short_name", "pval", "qval")]

## 得到用于拟时序分析基因
ordering_genes <- row.names(sig_genes)

下面是可选部分👇

再比如 Pseudotime

diff_test_res <- differentialGeneTest(sce[expressed_genes, ],
                                      fullModelFormulaStr = "~sm.ns(Pseudotime)")
sig_genes <- subset(diff_test_res, qval < 0.1)
sig_genes[1:6,c("gene_short_name", "pval", "qval")]

作图（注意要将基因名变成character）

cg=as.character(head(sig_genes$gene_short_name))
# 普通图
plot_genes_jitter(sce[cg,], 
                  grouping = "State", ncol= 2)
# 还能上色
plot_genes_jitter(sce[cg,],
                  grouping = "State",
                  color_by = "State",
                  nrow= 3,
                  ncol = NULL )

还可以绘制热图

sig_genes <- sig_genes[order(sig_genes$qval), ]
plot_pseudotime_heatmap(sce[row.names(sig_genes)[1:20],],
                num_clusters = 3,
                cores = 1,
                show_rownames = T)

得到类似图

image.png

BEAM分析

BEAM分析的目的是比较分枝点与分枝末端的差异。

BEAM_res <- BEAM(sce[expressed_genes, ], branch_point = 1, cores = 4)
BEAM_res <- BEAM_res[order(BEAM_res$qval),]
BEAM_res <- BEAM_res[,c("gene_short_name", "pval", "qval")]
plot_genes_branched_heatmap(sce[row.names(subset(BEAM_res,
                                          qval < 1e-4)),],
                                          branch_point = 1,
                                          num_clusters = 4,
                                          cores = 1,
                                          use_gene_short_name = T,
                                          show_rownames = T)

image.png

可以针对基因做图

plot_genes_branched_pseudotime(sce[rownames(subset(BEAM_res, qval < 1e-8)), ], 
                                 branch_point = 1, color_by = "orig.ident", 
                                 ncol = 3)

image.png

上面是可选部分 👆.

5.推断发育轨迹

用上面过滤的基因进行降维，以方便下一步将细胞分布到不同的状态之间，可视化拟时序得分.

5.1: choosing genes that define progress

5.2: reducing the dimensionality of the data

5.3: ordering the cells in pseudotime

5.1. 选合适基因

得到上面两种过滤的基因集：unsup_clustering_genes 或者 diff_test_res，选取合适基因进行排序.

## 以差异基因为例：
## ordering_genes <- row.names (subset(diff_test_res, qval < 0.01))
cds <- setOrderingFilter(cds, ordering_genes)
plot_ordering_genes(cds)

image.png

5.2 降维

默认使用DDRTree的方法进行降维分析，代码如下

cds <- reduceDimension(
  cds,
  max_components = 2,
  method = 'DDRTree')

5.3 细胞排序

然后就是细胞排序，代码如下：

cds <- orderCells(cds)

运行之后，对于每个细胞，会给出一个pseudotime值，示意如下

image.png

可视化代码：

默认用"state" 填充。应该是cluster 含义.

plot_cell_trajectory(sce)

image.png

降维之后，在二维空间展示细胞pseudotime的分布，可以看到是一个树状结构，除了上述方法外，还可以根据pseudotime的值给细胞赋颜色，代码如下

plot_cell_trajectory(sce, color_by = "Pseudotime")

image.png

其实就是将fData中对应的列设置为颜色，如果想要观察不同细胞亚型的分布，可以在fData中新增一列细胞对应的cluster ID, 然后用这一类来设置颜色。

对于pseudotime分析，我们需要明白它的基本输入就是一张基因在细胞中表达量的表格，与细胞的聚类结果无关，只不过在可视化的时候根据聚类的结果填充了颜色而已。

另外，plot_genes_in_pseudotime 可以对基因在不同细胞中的表达量变化进行绘图

plot_genes_in_pseudotime(cds[cg,],
                         color_by = "State")

小结：

【Workflow以及与Seurat的异同】

①创建CellDataSet对象（下简称CDS对象）
若要创建新的CDS对象，则需要整理出3个输入文件（基因-细胞表达矩阵、细胞-细胞特征注释矩阵、基因-基因特征注释矩阵），但方便的是，Monocle支持从Seurat中直接导入对象，通过importCDS命令实现。
在创建之后，也会进行数据过滤和标准化，不同的是Seurat是基于作图的方法去筛选掉异常的细胞，而Monocle的过滤方法则是根据表达数据的数学关系来实现。
上限： image.png

下限：

image.png

其中X为基因表达总数, n 为细胞数，sd为表达水平方差
大概就是以一个大致的细胞表达水平为基准，表达量太高的跟太低的去除掉。

②通过已知的Marker基因分类细胞
方法一：通过一些现有的生物/医学知识手动赋予它们细胞名，将这些细胞分为几大类，然后再聚类细胞。
方法二：与Seurat包的分类细胞方法类似，筛选出差异表达基因用于聚类，然后再根据每一个cluster的marker赋予细胞名。
③聚类细胞
采用的也是t-SNE算法。
④将细胞按照伪时间的顺序排列在轨迹上
- Step1：选择输入基因用于机器学习
  这个过程称为feature selection（特征选择），这些基因对轨迹的形状有着最重要的影响。我们应该要选择的是最能反映细胞状态的基因。
  如果直接通过Seurat输出的一些重要的基因（比如每个cluster中的高FC值基因）作为输入对象的话就能够实现一个“无监督”分析。或者也可以利用生物学知识手动选择一些重要的基因进行“半监督”分析。
- Step2：数据降维
  利用Reversed graph embedding算法将数据降维。
  相对于PCA来说这个算法更能够反应数据的内部结构（据monocle网站说是这样）
- Step3：将细胞按照伪时间排序
  这个过程称为manifold learning（流形学习）。Monocle利用轨迹来描述细胞如何从一个状态转换到另一个状态。得到的是一个树状图，树的根部或树干表示的是细胞最初的状态（如果有的话），随着细胞的变化就沿着分叉树枝发展。一个细胞的伪时间值（pseudotime value）为它的位置沿着树枝到根部的距离。

分析scRNA-seq的数据的关键在于如何对细胞进行cluster。这其中有很多的算法，而之后的降维分析比如tSNE其实主要还是为了数据图形化显示方便。在细胞分群之后，差异表达分析其实与第三章的RNA-seq并无二致，我们只需要对需要比较的因素做到心中有数即可。

Monocle2 学习笔记

参考：

主要内容

1. 读取数据

2.预处理

过滤细胞

3. 细胞分群/聚类

step1：dispersionTable()

step2：plot_pc_variance_explained() 耗时

step3: 进行降维和聚类

4.筛选基因(用于拟分析时序)

4.1 一般过滤标准

4.2 使用差异表达基因

BEAM分析

5.推断发育轨迹

5.1. 选合适基因

5.2 降维

5.3 细胞排序

小结：

猜你喜欢

热点阅读