单细胞测序专题集合单细胞转录组单细胞空间转录组

几个scRNA找高变异基因(HVGs)的方法

2019-09-15  本文已影响0人  刘小泽

刘小泽写于19.9.10
高变异基因就是highly variable features(HVGs),就是在细胞与细胞间进行比较,选择表达量差别最大的

1. Seurat

参考:https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html

利用FindVariableFeatures函数,会计算一个mean-variance结果,也就是给出表达量均值和方差的关系并且得到top variable features

计算方法主要有三种:

例如:使用Seurat 版本3

# V3 代码来自官方教程https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(pbmc), 10)

# 分别绘制带基因名和不带基因名的
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2))

使用Seurat版本2

# V2
pbmc <- FindVariableGenes(object = pbmc, 
                         mean.function = ExpMean, 
                         dispersion.function = LogVMR )
length( pbmc@var.genes) 
# 默认值是:x.low.cutoff = 0.1, x.high.cutoff = 8, y.cutoff = 1,就是说取log后的平均表达量(x轴)介于0.1-8之间的;分散程度(y轴,即标准差)至少为1的

2. Monocle

参考:https://cole-trapnell-lab.github.io/monocle-release/docs/#clustering-cells

# V2
cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds)
disp_table <- dispersionTable(cds) 
> head(disp_table)
     gene_id mean_expression dispersion_fit dispersion_empirical
1 AL669831.5     0.011673004      42.669671             0.000000
2      NOC2L     0.140316168       5.221419             1.696712
3    PLEKHN1     0.016292004      31.089206             0.000000
4 AL645608.8     0.009537725      51.814220             0.000000
5       HES4     0.265523990       3.619078            28.119205
6      ISG15     0.793811626       2.424032            11.047583

unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1) 
cds <- setOrderingFilter(cds, unsup_clustering_genes$gene_id) 
plot_ordering_genes(cds) 

3. scran

参考:https://bioconductor.riken.jp/packages/3.7/workflows/vignettes/simpleSingleCell/inst/doc/work-5-mnn.html

利用函数trendVar()decomposeVar() 根据表达量计算highly variable genes (HVGs)

fit <- trendVar(sce, parametric=TRUE) 
dec <- decomposeVar(sce, fit)

如果有不感兴趣的差异来源(例如plate、donor),为了确保后面鉴定HVGs过程不会扩大真实的数据偏差,可以设置block

block <- paste0(sce$Plate, "_", sce$Donor)
fit <- trendVar(sce,block=block, parametric=TRUE) 
dec <- decomposeVar(sce, fit)

最后作图

plot(dec$mean, dec$total, xlab="Mean log-expression", 
    ylab="Variance of log-expression", pch=16)
OBis.spike <- isSpike(sce)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)

decomposeVar函数帮助文档中有一句描述:Highly variable genes (HVGs) can be identified as those with large biological components. The biological component is determined by subtracting the technical component from the total variance.

HVGs能够代表大部分的生物学差异,而这种差异是由总体差异减去技术因素差异得到的

dec$Symbol <- rowData(dec)$Symbol
dec <- dec[order(dec$bio, decreasing=TRUE),]

> head(dec,2)
DataFrame with 2 rows and 7 columns
                            mean            total              bio
                       <numeric>        <numeric>        <numeric>
ENSG00000254647 2.83712754306791 6.30184692631371 5.85904290864641
ENSG00000129965 1.88188510741958 5.96360144483475  5.5152391307155
                             tech   p.value       FDR      Symbol
                        <numeric> <numeric> <numeric> <character>
ENSG00000254647 0.442804017667299         0         0         INS
ENSG00000129965 0.448362314119254         0         0    INS-IGF2

4. M3Drop

Brennecke et al. (2013) Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods 10, 1093-1095. doi:10.1038/nmeth.2645

library(M3DExampleData)
# 需要提供表达矩阵(expr_mat)=》normalized or raw (not log-transformed) 
HVG <- BrenneckeGetVariableGenes(expr_mat=M3DExampleData, spikes=NA, suppress.plot=FALSE, fdr=0.1, minBiolDisp=0.5, fitMeanQuantile=0.8)
HVG_spike <- BrenneckeGetVariableGenes(Mmus_example_list$data, spikes=5550:5600)

5. 自定义函数

Extract genes with a squared coefficient of variation >2 times the fit regression (Brennecke et al 2013 method)
实现了:Select the highly variable genes based on the squared coefficient of variation and the mean gene expression and return the RPKM matrix the the HVG

getMostVarGenes <- function(
  data=data,                # RPKM matrix
  fitThr=1.5,           # Threshold above the fit to select the HGV
  minMeanForFit=1           # Minimum mean gene expression level
){
  # data=females;fitThr=2;minMeanForFit=1   
  # Remove genes expressed in no cells
  data_no0 <- as.matrix(
    data[rowSums(data)>0,]
  )
  # Compute the mean expression of each genes
  meanGeneExp <- rowMeans(data_no0)
  names(meanGeneExp)<- rownames(data_no0)
  
  # Compute the squared coefficient of variation
  varGenes <- rowVars(data_no0)
  cv2 <- varGenes / meanGeneExp^2
  
  # Select the genes which the mean expression is above the expression threshold minMeanForFit
  useForFit <- meanGeneExp >= minMeanForFit
  
  # Compute the model of the CV2 as a function of the mean expression using GLMGAM
  fit <- glmgam.fit( cbind( a0 = 1, 
                            a1tilde = 1/meanGeneExp[useForFit] ), 
                     cv2[useForFit] )
  a0 <- unname( fit$coefficients["a0"] )
  a1 <- unname( fit$coefficients["a1tilde"])
  
  # Get the highly variable gene counts and names
  fit_genes <- names(meanGeneExp[useForFit])
  cv2_fit_genes <- cv2[useForFit]
  fitModel <- fit$fitted.values
  names(fitModel) <- fit_genes
  HVGenes <- fitModel[cv2_fit_genes>fitModel*fitThr]
  print(length(HVGenes))
  
  # Plot the result
  plot_meanGeneExp <- log10(meanGeneExp)
  plot_cv2 <- log10(cv2)
  plotData <-  data.frame(
    x=plot_meanGeneExp[useForFit],
    y=plot_cv2[useForFit],
    fit=log10(fit$fitted.values),
    HVGenes=log10((fit$fitted.values*fitThr))
  )
  p <- ggplot(plotData, aes(x,y)) +
    geom_point(size=0.1) +
    geom_line(aes(y=fit), color="red") +
    geom_line(aes(y=HVGenes), color="blue") +
    theme_bw() +
    labs(x = "Mean expression (log10)", y="CV2 (log10)")+
    ggtitle(paste(length(HVGenes), " selected genes", sep="")) +
    theme(
      axis.text=element_text(size=16),
      axis.title=element_text(size=16),
      legend.text = element_text(size =16),
      legend.title = element_text(size =16 ,face="bold"),
      legend.position= "none",
      plot.title = element_text(size=18, face="bold", hjust = 0.5),
      aspect.ratio=1
    )+
    scale_color_manual(
      values=c("#595959","#5a9ca9")
    )
  print(p)
  
  # Return the RPKM matrix containing only the HVG
  HVG <- data_no0[rownames(data_no0) %in% names(HVGenes),]
  return(HVG)
}

P.S.

参考:https://bioconductor.org/packages/release/workflows/vignettes/simpleSingleCell/inst/doc/var.html
https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html

还有其他的一些函数:


欢迎关注我们的公众号~_~  
我们是两个农转生信的小硕,打造生信星球,想让它成为一个不拽术语、通俗易懂的生信知识平台。需要帮助或提出意见请后台留言或发送邮件到jieandze1314@gmail.com

Welcome to our bioinfoplanet!
上一篇 下一篇

猜你喜欢

热点阅读