1、Quality control

2020-09-19  本文已影响0人  小贝学生信

原文链接Chapter 6 Quality Control

一、基础知识

1 为什么QC

因为低质量的cell通常会contribute to misleading results in downstream analyses

2 QC指标

spike-in是已知浓度的外源RNA分子。在单细胞裂解液中加入spike-in后,再进行反转录。最广泛使用的spike-in是由External RNA Control Consortium (ERCC)提供的。ERCC含量大,则说明total sum变小

3、示例数据

#scRNAseq包的一个公共数据
library(scRNAseq)
sce.416b <- LunSpikeInData(which="416b") 
sce.416b$block <- factor(sce.416b$block)
sce.416b
dim(sce.416b) #46604个基因,192个细胞

二、QC metrics 计算

## 注释线粒体基因
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
chr.loc <- mapIds(ens.mm.v97, keys=rownames(sce.416b),
                  keytype="GENEID", column="SEQNAME")
sort(table(chr.loc))
is.mito.alt <- which(chr.loc=="MT")
table(is.mito.alt)
#返回所有是线粒体基因的行名号
#biomaRt法
library("biomaRt")
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl") #人
attributes = listAttributes(ensembl)
test <- getBM(attributes=c('ensembl_gene_id', 'chromosome_name', 'hgnc_symbol'), 
      mart = ensembl)
2-1

如果基因名为SYMBOL ID,那么我们可以直接grepl("^MT-", rownames(sce))

# 基于线粒体注释,利用scater包直接计算上述三者指标
library(scater)
df <- perCellQCMetrics(sce.416b, subsets=list(Mito=is.mito.alt))
dim(df) ##192个细胞里表达情况(包括线粒体的相关情况)
colnames(df)
df[1:4,1:4]
2-2

关于 percent_top_number,表示的是一个比例。以percent_top_50为例,表示某细胞sum数前50个gene的sum和除以该细胞的counts sum总和。

colnames(colData(sce.416b))
sce.416b <- addPerCellQC(sce.416b, subsets=list(Mito=is.mito))
colnames(colData(sce.416b))
#除Mito信息外,也还自动注释了ERCC信息

计算所有细胞的这三个及指标,就可以筛选低质量细胞了。注意筛选的前体是我们假设the QC metrics are independent of the biological state of each cell. Poor values (e.g., low library sizes, high mitochondrial proportions) are presumed to be driven by technical factors rather than biological processes, meaning that the subsequent removal of cells will not misrepresent the biology in downstream analyses.

三、鉴定低质量细胞

1、fixed thresholds

即设置固定的阈值,一般通过经验判断,例如(1)library sizes below 100,000 reads;(2)express fewer than 5,000 genes;(3)have spike-in proportions above 10%; or have mitochondrial proportions above 10%.

qc.lib <- df$sum < 1e5
qc.nexprs <- df$detected < 5e3
qc.spike <- df$altexps_ERCC_percent > 10
qc.mito <- df$subsets_Mito_percent > 10
discard <- qc.lib | qc.nexprs | qc.spike | qc.mito
DataFrame(LibSize=sum(qc.lib), NExprs=sum(qc.nexprs),
    SpikeProp=sum(qc.spike), MitoProp=sum(qc.mito), Total=sum(discard))
3-1

2、adaptive thresholds

qc.lib2 <- isOutlier(df$sum, log=TRUE, type="lower")  
#lower表示取小,结果返回的是逻辑值T/F
attr(qc.lib2, "thresholds") #查看筛选阈值
qc.nexprs2 <- isOutlier(df$detected, log=TRUE, type="lower")
attr(qc.nexprs2, "thresholds")
qc.spike2 <- isOutlier(df$altexps_ERCC_percent, type="higher")
attr(qc.spike2, "thresholds")
qc.mito2 <- isOutlier(df$subsets_Mito_percent, type="higher")
attr(qc.mito2, "thresholds")
discard2 <- qc.lib2 | qc.nexprs2 | qc.spike2 | qc.mito2
#整合
DataFrame(LibSize=sum(qc.lib2), NExprs=sum(qc.nexprs2),
    SpikeProp=sum(qc.spike2), MitoProp=sum(qc.mito2), Total=sum(discard2))
3-2
reasons <- quickPerCellQC(df, percent_subsets=c("subsets_Mito_percent",
    "altexps_ERCC_percent"))
#返回维度与df相同的数据框,内容变为逻辑值,即是否应该抛弃
colSums(as.matrix(reasons))
tips:batch factor

考虑批次效应可能带来的错误筛选,计算予以调整,在上面的quickPerCellQC可以设置batch参数

batch <- paste0(sce.416b$phenotype, "-", sce.416b$block)
batch.reasons <- quickPerCellQC(df, percent_subsets=c("subsets_Mito_percent",
                                                      "altexps_ERCC_percent"), batch=batch)
colSums(as.matrix(batch.reasons))
3-3

注意这里同样也假设所有的batch都是高质量的,如果已知某一个batch可能都有问题,那么就不应该调整这个batch,以免影响整体的判断。具体可分别判断每个batch的阈值,观察比较有无明显异常值。

四、绘图观察剔除细胞

1、指标细胞阈值可视化

sce.416b$phenotype <- ifelse(grepl("induced", sce.416b$phenotype),
                             "induced", "wild type")
sce.416b$discard <- reasons$discard

gridExtra::grid.arrange(
  plotColData(sce.416b, x="block", y="sum", colour_by="discard",
              other_fields="phenotype") + facet_wrap(~phenotype) + 
    scale_y_log10() + ggtitle("Total count"),
  plotColData(sce.416b, x="block", y="detected", colour_by="discard", 
              other_fields="phenotype") + facet_wrap(~phenotype) + 
    scale_y_log10() + ggtitle("Detected features"),
  plotColData(sce.416b, x="block", y="subsets_Mito_percent", 
              colour_by="discard", other_fields="phenotype") + 
    facet_wrap(~phenotype) + ggtitle("Mito percent"),
  plotColData(sce.416b, x="block", y="altexps_ERCC_percent", 
              colour_by="discard", other_fields="phenotype") + 
    facet_wrap(~phenotype) + ggtitle("ERCC percent"),
  ncol=1
)
4-1

2、线粒体基因相关性绘图

plotColData(sce.416b, x="sum", y="subsets_Mito_percent", colour_by="discard")

如图表明,剔除的细胞都是sum小,而造成mito含量相对提高的异常细胞。正常细胞是mito含量与sum成正比的。


4-2

五、去除低质量细胞

straightforward!

filtered <- sce.416b[,!reasons$discard]
最后的验证
lost <- calculateAverage(counts(sce.416b)[,!discard])
kept <- calculateAverage(counts(sce.416b)[,discard])

library(edgeR)
logged <- cpm(cbind(lost, kept), log=TRUE, prior.count=2)
logFC <- logged[,1] - logged[,2]
abundance <- rowMeans(logged)
plot(abundance, logFC, xlab="Average count", ylab="Log-FC (lost/kept)", pch=16)
points(abundance[is.mito], logFC[is.mito], col="dodgerblue", pch=16)

关于empty droplets

e.out <- emptyDrops(counts(sce.pbmc))
# See ?emptyDrops for an explanation of why there are NA values.
summary(e.out$FDR <= 0.001)
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]

以上是第六章QC部分的简单流程笔记。原文还介绍了判断是否为empty droplets (第五点),以及保留低质量细胞的意义(第七点),详见Chapter 6 Quality Control
本系列笔记基于OSCA全流程的大致流程梳理,详细原理可参考原文。如有错误,恳请指正!
此外还有刘小泽老师整理的全文翻译笔记,详见目录

上一篇 下一篇

猜你喜欢

热点阅读