生物信息学R语言源码

3、Feature selection

2020-09-20  本文已影响0人  小贝学生信

原文链接Chapter 8 Feature selection

1、motivation

load("fluidigm_lgnm.Rdata")
fluidigm_lgnm
#加载上一步得到的结果

2、Quantifying per-gene variation

法一:modelGeneVar()(推荐)

library(scran)
#默认使用logcounts矩阵
dec.fluidigm <- modelGeneVar(fluidigm_lgnm)
dec.fluidigm[order(dec.fluidigm$bio, decreasing=TRUE),]
2-1
fit.fluidigm <- metadata(dec.fluidigm)
plot(fit.fluidigm$mean, fit.fluidigm$var, xlab="Mean of log-expression",
     ylab="Variance of log-expression")
curve(fit.fluidigm$trend(x), col="dodgerblue", add=TRUE, lwd=2)

法二:modelGeneCV2()

dec.cv2.fluidigm <- modelGeneCV2(fluidigm_lgnm)
head(dec.cv2.fluidigm[order(dec.cv2.fluidigm$ratio, decreasing=TRUE),])
2-3

Both the CV2 and the variance of log-counts are effective metrics for quantifying variation in gene expression.
原文推荐还是使用基于logcounts的第一种方法,By using log-values here, we ensure that our quantitative definition of heterogeneity is consistent throughout the entire analysis.

补充: diminish batch effect

dec.block.fluidigm <- modelGeneVar(fluidigm_lgnm, block=fluidigm_lgnm$Coverage_Type)
dec.block.fluidigm[order(dec.block.fluidigm$bio, decreasing=TRUE),][1:4,1:4] 
#sce.416b(未加载,仅参考)有block、phenotype两类
design <- model.matrix(~factor(block) + phenotype, colData(sce.416b))
dec.design.416b <- modelGeneVarWithSpikes(sce.416b, "ERCC", design=design)
dec.design.416b[order(dec.design.416b$bio, decreasing=TRUE),]

法三:考虑technical noise技术噪音(后来补充)

思路1 ERCC
#代码直接摘自原文
dec.spike.416b <- modelGeneVarWithSpikes(sce.416b, "ERCC")
dec.spike.416b[order(dec.spike.416b$bio, decreasing=TRUE),]

plot(dec.spike.416b$mean, dec.spike.416b$total, xlab="Mean of log-expression",
    ylab="Variance of log-expression")
fit.spike.416b <- metadata(dec.spike.416b)
points(fit.spike.416b$mean, fit.spike.416b$var, col="red", pch=16)
curve(fit.spike.416b$trend(x), col="dodgerblue", add=TRUE, lwd=2)
思路2 泊松分布
set.seed(0010101)
dec.pois.pbmc <- modelGeneVarByPoisson(sce.pbmc)
dec.pois.pbmc <- dec.pois.pbmc[order(dec.pois.pbmc$bio, decreasing=TRUE),]
head(dec.pois.pbmc)
plot(dec.pois.pbmc$mean, dec.pois.pbmc$total, pch=16, xlab="Mean of log-expression",
    ylab="Variance of log-expression")
curve(metadata(dec.pois.pbmc)$trend(x), col="dodgerblue", add=TRUE)

还是注意这句话:This revised assumption(法1、2) is generally reasonable but may be problematic in some scenarios where many genes at a particular abundance are affected by a biological process.

3、Selecting highly variable genes

3.1 Based on the largest metrics(推荐)

(1)modelGeneVar()结果按bio列选择最大的subset(default)

hvg.fluidigm.var <- getTopHVGs(dec.fluidigm, n=1000)
str(hvg.fluidigm.var)

(2)modelGeneCV2()结果按ratio列选择最大的subset

hvg.fluidigm.cv2 <- getTopHVGs(dec.cv2.fluidigm, var.field="ratio", n=1000)
str(hvg.fluidigm.cv2)

一般取500-5000,都是合理的。关于具体取多少个,原文建议为Our recommendation is to simply pick an arbitrary X and proceed with the rest of the analysis, with the intention of testing other choices later, rather than spending much time worrying about obtaining the “optimal” value.

3.2 Based on significance

hvg.fluidigm.var.2 <- getTopHVGs(dec.fluidigm, fdr.threshold=0.05)
length(hvg.fluidigm.var.2)

3.3 Keeping all genes above the trend

hvg.fluidigm.var.3 <- getTopHVGs(dec.fluidigm, var.threshold=0)

3.4 proportion (above the trend)

getTopHVGs(dec.fluidigm, prop=0.1)

4、get result sce

hvg.fluidigm.var <- getTopHVGs(dec.fluidigm, n=1000)
str(hvg.fluidigm.var)

4.1 subset the SingleCellExperiment

fluidigm.hvg <- fluidigm_lgnm[hvg.fluidigm.var,]
dim(fluidigm.hvg)   #1000  127
#加入原始矩阵
altExp(fluidigm.hvg, "original") <- fluidigm_lgnm
altExpNames(fluidigm.hvg)
#取回原始矩阵
fluidigm.original <- altExp(sce.fluidigm.hvg, "original", withColData=TRUE)#取回原始矩阵

4.2 subset.row=

fluidigm_lgnm <- runPCA(fluidigm_lgnm, subset_row=hvg.fluidigm.var)
reducedDimNames(fluidigm_lgnm)
#保存第一种结果,供后面的分析使用
save(fluidigm_lgnm,file = "fluidigm_lgnm_hvg.Rdata")

以上是第八章Feature selection部分的简单流程笔记。原文还介绍了基于spike in 与泊松分布的两种挑选高变基因的高级方法,详见Chapter 8 Feature selection
本系列笔记基于OSCA全流程的大致流程梳理,详细原理可参考原文。如有错误,恳请指正!
此外还有刘小泽老师整理的全文翻译笔记,详见目录

上一篇 下一篇

猜你喜欢

热点阅读