single cell RNA-seq analysis生信文献阅读汇总单细胞测序

OSCA单细胞数据分析笔记-14、Empty/Doublet d

2021-07-13  本文已影响0人  小贝学生信

对应原版教程第15、16章http://bioconductor.org/books/release/OSCA/overview.html
现行主流的Droplet-based单细胞测序技术主要思路是一个磁珠捕获一个细胞置于油包水的腔室里完成添加标签、建库操作。但在磁珠捕获的过程会出现未捕获到细胞或者两个细胞的异常情况。这就需要我们在分析单细胞数据中识别、过滤掉这些bad barcode(cell)。

(源网侵删~)

1、Empty droplet

1.1 关于Empty

#--- loading ---#
library(DropletTestFiles)
raw.path <- getTestFile("tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz")
out.path <- file.path(tempdir(), "pbmc4k")
untar(raw.path, exdir=out.path)
library(DropletUtils)
fname <- file.path(out.path, "raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)
sce.pbmc
#class: SingleCellExperiment 
#dim: 33694 737280

我们可以根据根据每个细胞的counts数大小从高到底排序,观察是否有明显的断崖趋势。

library(DropletUtils)
bcrank <- barcodeRanks(counts(sce.pbmc))
# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
     xlab="Rank", ylab="Total UMI count", cex.lab=1.2)
abline(h=metadata(bcrank)$inflection, col="darkgreen", lty=2)
abline(h=metadata(bcrank)$knee, col="dodgerblue", lty=2)
legend("bottomleft", legend=c("Inflection", "Knee"), 
       col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)

如下图呈现结果,在count数1000左右,细胞数量突然减少可能就意味着empty droplet。


1.2 去除empty droplet

set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
summary(e.out$FDR <= 0.001)
##    Mode   FALSE    TRUE    NA's 
## logical     989    4300  731991
#最后保留得到4300个细胞
#NA值表示count < 100的empty droplet(default parameter,可自定义修改)
retained <- sce.pbmc[,which(e.out$FDR <= 0.001)]
bcrank <- barcodeRanks(counts(retained))
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
     xlab="Rank", ylab="Total UMI count", cex.lab=1.2)
table(Sig=e.out$FDR <= 0.001, Limited=e.out$Limited)
##        Limited
## Sig     FALSE TRUE
##   FALSE   989    0
##   TRUE   1728 2572

2、Doublet droplet

2.1 关于Doublet droplet

sce.mam
#class: SingleCellExperiment 
#dim: 27998 2772

2.2 方法一:Detection with clusters

library(scDblFinder)
dbl.out <- findDoubletClusters(sce.mam)
dbl.out

如下结果含义可通过?findDoubletClusters查看。关键的几个指标有 num.de: query cluster与 source cluster的差异基因数(越少越有可能为doublet),lib.size: source cluster的median cell library/query cluster的median cell library(越少越有可能为doublet,因为doublet cell的library size会大一些)。此外prop表示query cluster的细胞数占总细胞数的比例,这可为我们对预期的doublet 数量提供判断(一般小于5%)。

library(scater)
chosen.doublet <- rownames(dbl.out)[isOutlier(dbl.out$num.de, type="lower", log=TRUE)]
chosen.doublet
dbl.out[chosen.doublet,]
#DataFrame with 1 row and 9 columns
#      source1     source2    num.de median.de        best    p.value lib.size1 lib.size2      prop
#  <character> <character> <integer> <numeric> <character>  <numeric> <numeric> <numeric> <numeric>
#6           2           1        13     507.5       Pcbp2 0.00128336  0.811531  0.516399  0.030303
#basal cells (Acta2) and alveolar cells (Csn2)
plotExpression(sce.mam, features=c("Acta2", "Csn2"), 
    x="label", colour_by="label")

如下图,的确cluster6高表达两种细胞类型的marker gene,很有可能就是doublet cluster。再后续分析种予以去除。


最后关于这种基于cluster的判断doublet方法比较依赖合适的聚类结果。如教程所说:Clusters that are too coarse will fail to separate doublets from other cells, while clusters that are too fine will complicate interpretation.

2.2 方法二:Detection by simulation

library(BiocSingular)
set.seed(100)
# Setting up the parameters for consistency with denoisePCA();
# this can be changed depending on your feature selection scheme.
dbl.dens <- computeDoubletDensity(sce.mam, subset.row=top.mam, 
    d=ncol(reducedDim(sce.mam)))
summary(dbl.dens)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.249   0.527   1.041   1.188  14.570
sce.mam$DoubletScore <- dbl.dens
plotTSNE(sce.mam, colour_by="DoubletScore")
plotColData(sce.mam, x="label", y="DoubletScore", colour_by="label")

最后强调一点:无论是empty droplet还是doublet,都应当基于单批次的样本进行操作处理。因为每个批次的测序环境都是不一致的,不能够符合统一的假设。即对于多批次的样本,应该进行分别的鉴别、过滤,再进行后续的分析流程。

上一篇下一篇

猜你喜欢

热点阅读