3、Feature selection
原文链接Chapter 8 Feature selection
1、motivation
- select genes that contain useful information about the biology of the system while removing genes that contain random noise.
- preserve interesting biological structure without the variance that obscures that structure
- reduce the size of the data to improve computational efficiency of later steps.(clustering and dimensionality reduction)
- simplest approach to feature selection is to select the most variable genes based on their expression across the population.
load("fluidigm_lgnm.Rdata")
fluidigm_lgnm
#加载上一步得到的结果
2、Quantifying per-gene variation
- most variable genes即基因在不同细胞中表达值的方差值最大的基因。
-但一般来说,变量值平均水平高,其离散程度的测度值越大,反之越小。 - 因此在计算过程中要消除均值过大(abundance)带来的不平衡影响。
-
scater
包提供了基于不同思路的方法,简单介绍如下
法一:modelGeneVar()
(推荐)
library(scran)
#默认使用logcounts矩阵
dec.fluidigm <- modelGeneVar(fluidigm_lgnm)
dec.fluidigm[order(dec.fluidigm$bio, decreasing=TRUE),]

- 原理,可通过mean与var的点图拟合曲线了解,本质为方差分解。
fit.fluidigm <- metadata(dec.fluidigm)
plot(fit.fluidigm$mean, fit.fluidigm$var, xlab="Mean of log-expression",
ylab="Variance of log-expression")
curve(fit.fluidigm$trend(x), col="dodgerblue", add=TRUE, lwd=2)
-
如下图,点(基因)y坐标表示总方差,拟合曲线下方表示tech var,y坐标到拟合曲线距离即表示细胞异质性波动,为我们关心的部分。
total:Variance of the normalized log-expression per gene.
bio:Biological component of the variance.
tech:Technical component of the variance.
2-2
法二:modelGeneCV2()
- 通过变异系数(CV=sd/mean)平方消除均值影响,得到的ratio(=total/trend)列为我们重点关注的部分
- 值得注意的是此函数基于名为
counts
的原始矩阵normlized(未log处理)计算的
dec.cv2.fluidigm <- modelGeneCV2(fluidigm_lgnm)
head(dec.cv2.fluidigm[order(dec.cv2.fluidigm$ratio, decreasing=TRUE),])

- mean:Mean normalized expression per gene.
- total:Squared coefficient of variation of the normalized expression per gene.
- trend:Fitted value of the trend.
- ratio:Ratio of total to trend.
-
p.value, FDR:Raw and adjusted p-values for the test against the null hypothesis that ratio<=1.
2-4
Both the CV2 and the variance of log-counts are effective metrics for quantifying variation in gene expression.
原文推荐还是使用基于logcounts的第一种方法,By using log-values here, we ensure that our quantitative definition of heterogeneity is consistent throughout the entire analysis.
补充: diminish batch effect
- 消除一类批次影响
dec.block.fluidigm <- modelGeneVar(fluidigm_lgnm, block=fluidigm_lgnm$Coverage_Type)
dec.block.fluidigm[order(dec.block.fluidigm$bio, decreasing=TRUE),][1:4,1:4]
- 消除多类批次影响
#sce.416b(未加载,仅参考)有block、phenotype两类
design <- model.matrix(~factor(block) + phenotype, colData(sce.416b))
dec.design.416b <- modelGeneVarWithSpikes(sce.416b, "ERCC", design=design)
dec.design.416b[order(dec.design.416b$bio, decreasing=TRUE),]
法三:考虑technical noise技术噪音(后来补充)
- Strictly speaking, the use of a trend fitted to endogenous genes assumes that the expression profiles of most genes are dominated by random technical noise.
- This revised assumption is generally reasonable but may be problematic in some scenarios where many genes at a particular abundance are affected by a biological process.
思路1 ERCC
- fitting a mean-dependent trend to the variance of the spike-in transcripts;
- For spike-ins should not be affected by biological variation;
#代码直接摘自原文
dec.spike.416b <- modelGeneVarWithSpikes(sce.416b, "ERCC")
dec.spike.416b[order(dec.spike.416b$bio, decreasing=TRUE),]
plot(dec.spike.416b$mean, dec.spike.416b$total, xlab="Mean of log-expression",
ylab="Variance of log-expression")
fit.spike.416b <- metadata(dec.spike.416b)
points(fit.spike.416b$mean, fit.spike.416b$var, col="red", pch=16)
curve(fit.spike.416b$trend(x), col="dodgerblue", add=TRUE, lwd=2)
思路2 泊松分布
- 很多情况下按到的数据可能没有ERCC,那么可以利用泊松分布假设
set.seed(0010101)
dec.pois.pbmc <- modelGeneVarByPoisson(sce.pbmc)
dec.pois.pbmc <- dec.pois.pbmc[order(dec.pois.pbmc$bio, decreasing=TRUE),]
head(dec.pois.pbmc)
plot(dec.pois.pbmc$mean, dec.pois.pbmc$total, pch=16, xlab="Mean of log-expression",
ylab="Variance of log-expression")
curve(metadata(dec.pois.pbmc)$trend(x), col="dodgerblue", add=TRUE)
还是注意这句话:This revised assumption(法1、2) is generally reasonable but may be problematic in some scenarios where many genes at a particular abundance are affected by a biological process.
3、Selecting highly variable genes
3.1 Based on the largest metrics(推荐)
- 使用
getTopHVGs()
函数
(1)modelGeneVar()
结果按bio
列选择最大的subset(default)
hvg.fluidigm.var <- getTopHVGs(dec.fluidigm, n=1000)
str(hvg.fluidigm.var)
(2)modelGeneCV2()
结果按ratio
列选择最大的subset
hvg.fluidigm.cv2 <- getTopHVGs(dec.cv2.fluidigm, var.field="ratio", n=1000)
str(hvg.fluidigm.cv2)
一般取500-5000,都是合理的。关于具体取多少个,原文建议为Our recommendation is to simply pick an arbitrary X and proceed with the rest of the analysis, with the intention of testing other choices later, rather than spending much time worrying about obtaining the “optimal” value.
3.2 Based on significance
hvg.fluidigm.var.2 <- getTopHVGs(dec.fluidigm, fdr.threshold=0.05)
length(hvg.fluidigm.var.2)
3.3 Keeping all genes above the trend
hvg.fluidigm.var.3 <- getTopHVGs(dec.fluidigm, var.threshold=0)
3.4 proportion (above the trend)
getTopHVGs(dec.fluidigm, prop=0.1)
4、get result sce
hvg.fluidigm.var <- getTopHVGs(dec.fluidigm, n=1000)
str(hvg.fluidigm.var)
4.1 subset the SingleCellExperiment
- 直接截取SingleCellExperiment的感兴趣基因部分。
- 若想保留原始信息,可将原始矩阵加入新sce的altExp slot里
fluidigm.hvg <- fluidigm_lgnm[hvg.fluidigm.var,]
dim(fluidigm.hvg) #1000 127
#加入原始矩阵
altExp(fluidigm.hvg, "original") <- fluidigm_lgnm
altExpNames(fluidigm.hvg)
#取回原始矩阵
fluidigm.original <- altExp(sce.fluidigm.hvg, "original", withColData=TRUE)#取回原始矩阵
4.2 subset.row=
- 仍然使用原始矩阵,当后面的分析用到差异基因时,加上
subset.row=
参数即可。 - 例如PCA分析时--
fluidigm_lgnm <- runPCA(fluidigm_lgnm, subset_row=hvg.fluidigm.var)
reducedDimNames(fluidigm_lgnm)
#保存第一种结果,供后面的分析使用
save(fluidigm_lgnm,file = "fluidigm_lgnm_hvg.Rdata")
以上是第八章Feature selection部分的简单流程笔记。原文还介绍了基于spike in 与泊松分布的两种挑选高变基因的高级方法,详见Chapter 8 Feature selection
本系列笔记基于OSCA全流程的大致流程梳理,详细原理可参考原文。如有错误,恳请指正!
此外还有刘小泽老师整理的全文翻译笔记,详见目录