DESeq2差异分析十全大补使用笔记

2020-04-22 本文已影响0人 NICE_AGIS

写在前面：新手老菜鸟，个人心得，不足之处欢迎交流指正；
写作不易，转载前请联系我征得同意，谢谢。

1.基本原理

1.1 概述

全称：DESeq2 package for differential analysis of count data;

利用负二项分布广义线性模型( negative binomial generalized linear models），同时，还利用了离散型估计、logFoldChange;

负二项分布是一个离散分布，符合测序reads分布；

1.2 构建dds

要求输入原始 reads count 数；不接受已经做过处理的FPKM/TPM等，因为软件有自己的标准化计算方法；

构建dds。需要设置design公式，即告诉软件你的数据是怎样来的，基本试验设计如何，软件会根据几个变量综合计算；
一般：design =~ variable1 + variable2 + ...；
只有一个变量时：design=~ condition；
很多医学分析会加入年龄、性别等：design=~sex+disease+condition；
可以对应几个变量，但如果没有额外参数，log2FC和p值是默认对design公式中的最后一个变量或者最后一个因子与参考因子进行比较；

1.3 函数与计算

1.3.1 标准化：DESeq函数

不同样品的测序量有差异，最简单的标准化方式是计算counts per million (CPM) = 原始reads count ÷ 总reads数 x 1,000,000；

这种计算方式，易受到极高表达且在不同样品中存在差异表达的基因的影响：这些基因的打开或关闭会影响到细胞中总的分子数目，可能导致这些基因标准化之后就不存在表达差异了，而原本没有差异的基因标准化之后却有差异了；

RPKM、FPKM、TPM 是 CPM按照基因或转录本长度归一化后的表达，都会受到这一影响；

DESeq2的方法：

量化因子 (size factor,SF)，首先计算每个基因在所有样品中表达的几何平均值；每个细胞的SF是所有基因与其在所有样品中的表达值的几何平均值的比值的中位数；由于几何平均值的使用，只有在所有样品中表达都不为0的基因才能用来计算。这一方法又被称为RLE(relative log expression)。

不但考虑了测序深度的问题，还考虑了表达量超高或者极显著差异表达的基因导致count的分布出现偏倚。

DESeq函数分析：

三步：estimation of size factors（estimateSizeFactors)， estimation of dispersion（estimateDispersons)， Negative Binomial GLM fitting and Wald statistics（nbinomWaldTest）；

可以分步运行，也可一步到位，最后返回 results可用的DESeqDataSet对象。

1.3.2 归一化：rlog/vst

是我自己取的名字，可能不准确，我用在对dds进行vst然后做PCA分析。

全称：快速估算离散趋势并应用方差稳定转换；
若 samples<30 用 rlog函数，>30用 vst；
类似的函数：gmodels - fast.prcomp，输入数据为TPM；或者TMM；

1.3.3 数据收缩：lfcShrink：

shrink the log2 foldchange，不会改变显著差异的基因总数，作者很推荐这个新功能。

为何采用lfcShrink？log2FC estimates do not account for the large dispersion we observe with low read counts. 因此，两种数据特别需要：低表达量占比高的；数据特别分散的。

说的就是我的数据啊！但是我只用来做MA plot并没用来差异分析，因为：

lfcShrink 不改变p值q值，但改变了fc，使 foldchange范围变小，所以选择DEG时会有不同结果，一般会偏少！所以，根据数据情况，本次分析DEG还是不做shrink。

1.3.4 p-value和q-value

作者给出的建议：
Need to filter on adjusted p-values, not p-values, to obtain FDR control. 10% FDR is common because RNA-seq experiments are often exploratory and having 90% true positives in the gene set is ok.
即：用padj为标准做结果筛选。

事实上，在软件计算过程中，多次以alpha表示padj，并默认alpha=0.1；

1.3.5 MA plot

MA plot也叫 mean-difference plot或者Bland-Altman plot，用来估计模型中系数的分布;

X轴, the "A"(average)；Y轴，the "M"(minus) – subtraction of log values is equivalent to the log of the ratio;

M表示log fold change，衡量基因表达量变化，上调还是下调；A表示每个基因的count的均值；

根据summary(res)可知，low count的比率很高，所以大部分基因表达量不高，也就是集中在0的附近（log2(1)=0，也就是变化1倍），提供了模型预测系数的分布总览。

本次做了lfcshrink+batch之后，MA图趋于正常，比较集中在0附近，一些差异基因也可以明显看出。

1.4 DESeq(dds)结果矩阵每一列的含义：

baseMean： is a just the average of the normalized count values, dividing by size factors, taken over all samples in the DESeqDataSet；是对照组的样本标准化counts的均值；
log2FoldChange： the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment with dexamethasone in comparison to untreated samples；也不是简单的用标准化的counts进行计算，因为计算的时候需要考虑零值以及其他效应；结果是log2fc(trt/untrt)所以要注意对照和处理的指定；
lfcSE： the standard error estimate for the log2 fold change estimate，(the effect size estimate has an uncertainty associated with it,)；
p value: statistical test , the result of this test is reported as a p value. Remember that a p value indicates the probability that a fold change as strong as the observed one, or even stronger, would be seen under the situation described by the null hypothesis；

p value有时候是NA：Sometimes a subset of the p values in res will be NA (not available); This is DESeq's way of reporting that all counts for this gene were zero, and hence no test was applied. In addition, p values can be assigned NA if the gene was excluded from analysis because it contained an extreme count outlier. For more information, see the outlier detection section of the DESeq2 vignette.

padj: adjusted p-value;

1.5 实际遇到的其它问题

1.5.1 pre-analysis 预分析

就是开始熟悉你的数据，知道ta的特点，ta是什么样子什么脾气秉性，什么方式更能发现真实的ta，什么方式能扬长避短！选择合适的分析方法；

先做PCA！
方法包括且不仅包括：
gmodels - fast.prcomp
DESeq2 - vst - plotPCA

1.5.2 批次效应

1. 本项目的批次效应：

根据预先进行的PCA和correlation结果，本项目样品individual间的差异性大于不同处理的差异；

若不去除，则根据不同A处理寻找DEG时必然会因为individual的影响而覆盖掉一些，导致结果偏少；（test事实证明确是如此）；（本项目也做了removebatchEffect但是仅仅为了展示和验证）；

因此，将4个体当做4个批次，写入design公式：

design =~ individual + condition

一般批次效应：

可以用limma removeBatchEffect或者Combat等去除；

但是在做差异分析时，ballgown, DESeq2等软件建议不要提前去批次，而是将批次作为covariate进行分析。

1.5.3 多组比较

若有好几组处理，想两两比较，是分开准备数据、DESeq、results？还是全部数据一起？（注意：是几组之间两两比较，不是一比多，一比多请移步多重t检验或者WGCNA，作者Mike Love也说不能做！）

官方建议：

如有多个组需要比较，建议不要将其两两分开而是一起分析，通过在results时指定contrast对象，获得两两的比较结果，这样可以综合考虑所以样品中的表达计算量化因子做DESeq；

但是，如果你通过PCA/EDA分析发现某一组或某几组的within-group variability比其他组的大，那么还是还是两两分组分开比较吧！

举例，下面的情况就不适合一起做：（这可不就是我的数据嘛！所以，分开比较分别做DESeq更sensitive。）

PCA

1.5.4 低表达量基因过滤

实际分析完后，我发现一些假阳性DEG，即由于一个样本中出现极高reads而其它样本都是0 reads导致的DEG，在bioconductor上发帖得到了作者回复，于是加入附加条件：筛除在少于3个样品中表达量少于10 reads的基因，再做DESeq标准化和DEG筛选。**
举例：有一个基因仅仅因为在处理2中有一个表达而其他都是0，就被认为是处理2/1的上调和处理3/2的下调，很不靠谱；
解决：（这个标准可以根据数据特点，自己制定）

keep <- rowSums(counts(dds) >= 10) >= 3   
dds <- dds[keep, ]

1.5.5 outliers离群值处理

这部分提前写了，outliers是在results之后，summary(res)可以看到差异比较的一个基本结果，有一项outliers，若占比很高数量成百上千，则要引起警惕。
解决方法：

软件作者提到：
The automatic outlier filtering/replacement is most useful in situations which the number of outliers is limited. When there are thousands of reported outliers, it might make more sense to turn off the outlier filtering/replacement (DESeq with minReplicatesForReplace=Inf and results with cooksCutoff=FALSE) and perform manual inspection.
就是说离群值太多的话还是关闭这一筛选功能，方法也提到了：

minReplicatesForReplace = Inf 
results with cooksCutoff = FALSE

实践证明两个条件择一即可，outliers消失，很多假阴性降低。

分析前可以做一个欧氏距离计算，看是不是有样本特别高或者低，可以剔除，方法见脚本。

2. 实操

经过多次分析和调整，最后用的代码是：

（1）安装DESeq2包

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("DESeq2")

（2）导入数据两两比较

setwd('xxx')
colData <- read.table('group.txt', header=TRUE, row.names=1)
readscount <- read.table('readscount.txt', header=TRUE, row.names=1)
condition <- factor(c(rep("treat1",10),rep("treat2",10)))
individual <- factor(c(rep("idv1",3),"idv2",rep("idv3",3),rep("idv4",3), 
                  rep("idv1",3),rep("idv2",3),rep("idv4",4)))
colData
head(readscount)
condition
individual
library(DESeq2) 
dds <- DESeqDataSetFromMatrix(readscount, colData, design =~ individual + condition)
keep <- rowSums(counts(dds) >= 10) >= 3   #过滤低表达基因，至少在3个样品中都有10个reads 
dds <- dds[keep, ]

（3）PCA（还有全部样品的PCA在另一个脚本）

vsdata <- vst(dds, blind=FALSE)  #归一化
assay(vsdata) <- limma::removeBatchEffect(assay(vsdata), vsdata$individual)  #去批次效应
plotPCA(vsdata, intgroup = "condition") #自带函数

（4）差异分析

dds_norm <- DESeq(dds, minReplicatesForReplace = Inf) #标准化; 不剔除outliers; 与cookscutoff结果相同
dds_norm$condition   #保证是levels是按照后一个比前一个即trt/untrt，否则需在results时指定
res <- results(dds_norm, contrast = c("condition","treat2","treat1"), cooksCutoff = FALSE) #alpha=0.05可指定padj; cookCutoff是不筛选outliers因为太多了
summary(res)  
#resOrdered <- res[order(res$pvalue), ] #排序
sum(res$padj<0.05, na.rm = TRUE)
res_data <- merge(as.data.frame(res),
              as.data.frame(counts(dds_norm,normalize=TRUE)),
              by="row.names",sort=FALSE)
up_DEG <- subset(res_data, padj < 0.05 & log2FoldChange > 1)
down_DEG <- subset(res_data, padj < 0.05 & log2FoldChange < -1)
write.csv(res_data, "all.csv") #全部基因不筛选，做火山图的背景
write.csv(up_DEG, "up.csv")
write.csv(down_DEG, "down.csv")

（5）判断欧氏距离，若有异常样品则不用cooksCutoff；当有上千个异常值时也不用：（完全可以不做）

par(mar=c(8,5,2,2))
boxplot(log10(assays(dds_norm)[["cooks"]]), range=0, las=2)

（6）lfcshrink & MA plot

library(apeglm)  
resultsNames(dds_norm)  #看一下要shrink的维度;shrink数据更加紧凑,少了一项stat，但并未改变padj，但改变了foldchange
res_shrink <- lfcShrink(dds_norm, coef="condition_treat2_vs_treat1", type="apeglm") #最推荐apeglm算法;根据resultsNames(dds)的第5个维度，coef=5，也可直接""指定;apeglm不allow contrast，所以要指定coef
pdf("MAplot.pdf", width = 6, height = 6) 
plotMA(res_shrink, ylim=c(-10,10), alpha=0.1, main="MA plot: ")
dev.off()

（7）火山图

library(ggplot2)
voldata <-read.csv(file = "all.csv",header = TRUE, row.names =1)
pdf("volcano.pdf", width = 6.13, height = 5.18)
ggplot(data=voldata, aes(x=log2FoldChange,y= -1*log10(padj))) +
  geom_point(aes(color=significant)) +
  scale_color_manual(values=c("#546de5", "#d2dae2","#ff4757")) + 
  labs(title="Volcano Plot: ", x=expression(log[2](FC), y=expression(-log[10](padj)))) +
  geom_hline(yintercept=1.3,linetype=4) +  #反对数,代表0.05的线
  geom_vline(xintercept=c(-1,1),linetype=4) +
  theme_bw() + theme(panel.grid = element_blank())  #主次网格线均为空白
dev.off()

3. 心得

必须充分了解自己的数据，分析方法万万千，选最合适的才是最好的！
任何软件教程都不如官网README 和文章，遇到问题时要想“一定不是我一个人”，善于用Bioconductor论坛。
不要钻牛角尖，完整比完美重要，任何分析都有不周到，学会带着遗憾继续走，不必十全十美，因为永远也不可能做到。