OSCA单细胞数据分析笔记-12、Intergrating Da

2021-06-21  本文已影响0人  小贝学生信

对应原版教程第13章http://bioconductor.org/books/release/OSCA/overview.html
无论是scRNA-seq,还是Bulk RNA-seq,批次效应都是一个很头疼的问题,如何有效地校正、并且正确地使用校正后的数据是很值得讨论的分析点。

源网,侵删~

笔记要点

1、单细胞数据的批次效应 batch effect
2、评判批次效应(示例数据)
3、两种消除批次效应的方法
4、进一步评价校正批次效应的结果
5、关于校正之后的细胞表达水平的思考

1、单细胞数据的批次效应 batch effect

2、评判批次效应(示例数据)

2.1 示例数据

#均已分别完成质控、hvg、降维、聚类。具体见原教程
pbmc3k
#class: SingleCellExperiment 
#dim: 31232 2609 
dec3k
# DataFrame with 31232 rows and 6 columns
# mean     total      tech          bio   p.value       FDR
# <numeric> <numeric> <numeric>    <numeric> <numeric> <numeric>
#   ENSG00000243485         0         0         0            0       NaN       NaN

pbmc4k
#class: SingleCellExperiment 
#dim: 31232 4182
dec4k
# DataFrame with 31232 rows and 6 columns
# mean       total        tech          bio   p.value       FDR
# <numeric>   <numeric>   <numeric>    <numeric> <numeric> <numeric>
#   ENSG00000243485 0.000000000 0.000000000 0.000000000  0.00000e+00       NaN       NaN

(1) 取测序基因交集

universe <- intersect(rownames(pbmc3k), rownames(pbmc4k))
length(universe)

# Subsetting the SingleCellExperiment object.
pbmc3k <- pbmc3k[universe,]
pbmc4k <- pbmc4k[universe,]

# Also subsetting the variance modelling results, for convenience.
dec3k <- dec3k[universe,]
dec4k <- dec4k[universe,]

(2) 使两个批次的测序深度相统一

library(batchelor)
rescaled <- multiBatchNorm(pbmc3k, pbmc4k)
pbmc3k <- rescaled[[1]]
pbmc4k <- rescaled[[2]]

2.2 评判批次效应

想看看多个批次间的测序数据是否存在明显的批次效应,首先需要直接合并多个批次的数据集,然后降维聚类,观察聚类结果,是否存在仅仅由单个batch的细胞组成的cluster。若存在,则表明的确有批次效应。此外还可通过绘制t-SNE二维可视化,直接观察批次间的“分离度”

rowData(pbmc3k) <- rowData(pbmc4k)
pbmc3k$batch <- "3k"
pbmc4k$batch <- "4k"
uncorrected <- cbind(pbmc3k, pbmc4k)
library(scran)
combined.dec <- combineVar(dec3k, dec4k)
chosen.hvgs <- combined.dec$bio > 0 #筛选标准放宽很多
sum(chosen.hvgs)
#13431
# Using RandomParam() as it is more efficient for file-backed matrices.
library(scater)
set.seed(0010101010)
uncorrected <- runPCA(uncorrected, subset_row=chosen.hvgs,
                      BSPARAM=BiocSingular::RandomParam())
library(scran)
snn.gr <- buildSNNGraph(uncorrected, use.dimred="PCA")
clusters <- igraph::cluster_walktrap(snn.gr)$membership
tab <- table(Cluster=clusters, Batch=uncorrected$batch)
tab
基于uncorrected的聚类结果
set.seed(1111001)
uncorrected <- runTSNE(uncorrected, dimred="PCA")
plotTSNE(uncorrected, colour_by="batch")
基于uncorrected的t-SNE降维结果

3、两种消除批次效应的方法

3.1 Linear regression

This(rescaleBatches()) is roughly equivalent to applying a linear regression to the log-expression values per gene, with some adjustments to improve performance and efficiency. For each gene, the mean expression in each batch is scaled down until it is equal to the lowest mean across all batches.

library(batchelor)
rescaled <- rescaleBatches(pbmc3k, pbmc4k)
rescaled
#class: SingleCellExperiment 
#dim: 31232 6791
# To ensure reproducibility of the randomized PCA.
set.seed(1010101010) 
rescaled <- runPCA(rescaled, subset_row=chosen.hvgs, 
                   exprs_values="corrected",
                   BSPARAM=BiocSingular::RandomParam())

snn.gr <- buildSNNGraph(rescaled, use.dimred="PCA")
clusters.resc <- igraph::cluster_walktrap(snn.gr)$membership
tab.resc <- table(Cluster=clusters.resc, Batch=rescaled$batch)
tab.resc
基于线性回归校正批次效应的聚类结果
rescaled <- runTSNE(rescaled, dimred="PCA")
rescaled$batch <- factor(rescaled$batch)
plotTSNE(rescaled, colour_by="batch")
基于线性回归校正批次效应的t-SNE降维结果

在评判批次效应时,要时刻记住batch-specific cluster可能是由于批次效应造成的异质性,同时也有可能是每个batch具有specific subpopulation。因此需要分析者对测序样本有一定的把握。

3.2 MNN(Mutual nearest neighbour)

# Again, using randomized SVD here, as this is faster than IRLBA for
# file-backed matrices. We set deferred=TRUE for greater speed.
set.seed(1000101001)
mnn.out <- fastMNN(pbmc3k, pbmc4k, d=50, k=20, subset.row=chosen.hvgs,
                   BSPARAM=BiocSingular::RandomParam(deferred=TRUE))
mnn.out
# class: SingleCellExperiment 
# dim: 13431 6791 
# metadata(2): merge.info pca.info
# assays(1): reconstructed
# rownames(13431): ENSG00000239945 ENSG00000228463 ... ENSG00000198695 ENSG00000198727
# rowData names(1): rotation
# colnames: NULL
# colData names(1): batch
# reducedDimNames(2): corrected TSNE
# altExpNames(0):
library(scran)
snn.gr <- buildSNNGraph(mnn.out, use.dimred="corrected")
clusters.mnn <- igraph::cluster_walktrap(snn.gr)$membership
tab.mnn <- table(Cluster=clusters.mnn, Batch=mnn.out$batch)
tab.mnn
基于MNN校正批次效应的聚类结果
library(scater)
set.seed(0010101010)
mnn.out <- runTSNE(mnn.out, dimred="corrected")
mnn.out$batch <- factor(mnn.out$batch)
plotTSNE(mnn.out, colour_by="batch")
基于MNN校正批次效应的t-SNE降维结果

最后关于K值的意义与选择:从最理想的结果来看相同组织、不同批次的相同细胞类型的细胞间最有可能组成cp(MNN pairs);K值越大,一个细胞就会参与到更多的MNN pairs当中,即适合放宽相同细胞类型的标准的分析情况。
此外如果先前已对每个批次进行单独的全套流程的单细胞数据分析,然后再尝试多个批次的合并效果。fastMNN()subset.row参数可以是每个批次的全部marker gene的组合,详见原教程13.7,这里就不多展开了。

4、进一步评价校正批次效应的结果

以下的示例,均以3.2MNN的校正结果为例进行分析

4.1 同一cluster里不同batch来源的细胞组成比例

chi.prop <- colSums(tab.mnn)/sum(tab.mnn)
chi.results <- apply(tab.mnn, 1, FUN=chisq.test, p=chi.prop)
p.values <- vapply(chi.results, "[[", i="p.value", 0)
p.values
# 1            2            3            4            5            6            7            8 
# 9.047103e-02 3.093226e-02 6.699936e-03 2.626832e-03 8.423646e-20 2.774871e-01 5.546015e-05 2.273510e-11 
# 9           10           11           12           13           14           15           16 
# 2.136201e-04 5.479656e-05 4.019086e-03 2.971868e-03 1.538104e-12 3.936414e-05 2.197300e-04 7.172327e-01

4.2 校正后的结果相比原始batch的异质性是否丢失

library(bluster)
ri3k <- pairwiseRand(clusters.mnn[rescaled$batch==1], colLabels(pbmc3k), mode="index")
ri3k
#0.7360574
ri4k <- pairwiseRand(clusters.mnn[rescaled$batch==2], colLabels(pbmc4k), mode="index")
ri4k
#0.8300956
# For the first batch.
tab <- pairwiseRand(colLabels(pbmc3k), clusters.mnn[rescaled$batch==1])
heat3k <- pheatmap(tab, cluster_row=FALSE, cluster_col=FALSE,
                   col=rev(viridis::magma(100)), main="PBMC 3K probabilities", silent=TRUE)
heat3k

5、关于校正之后的细胞表达水平的思考

This strategy is based on the expectation that any genuine DE between clusters should still be present in a within-batch comparison where batch effects are absent.

# uncorrected 为直接合并的原始表达矩阵
# clusters.mnn 为基于校正批次效应合并后的表达矩阵的聚类分群结果
m.out <- findMarkers(uncorrected, clusters.mnn, block=uncorrected$batch,
                     direction="up", lfc=1, row.data=rowData(uncorrected)[,3,drop=FALSE])

# A (probably activated?) T cell subtype of some sort:
demo <- m.out[["10"]]
as.data.frame(demo[1:20,c("Symbol", "Top", "p.value", "FDR")]) 

plotExpression(uncorrected, x=I(factor(clusters.mnn)), 
               features="ENSG00000177954", colour_by="batch") + facet_wrap(~colour_by)

我想:基于marker gene的细胞类型鉴定应该也需要基于校正批次效应合并后的表达矩阵的聚类分群结果,再结合对各个批次原始数据单独分析加以判断。

上一篇 下一篇

猜你喜欢

热点阅读