使用DESeq2标准化之后的数据进行PCA、聚类等可视化
DESeq2进行数据标准化的命令有多种
counts(dds, normalized=T)
-
rlog
、VST
两者的区别在于
前者
are “only” library-size normalised
而后者
more advanced
PCA和聚类等可视化分析时应该使用后者
downstream processing generally requires more advanced normalisation
个人理解
-
counts(dds, normalized=T)
是用于做DEG的标准化方法,DEG只是要比较不同样本间同一个基因是否有差异,因此只把counts在样本内做了标准化,从而使不同样本的同一个基因具有可比性。 - PCA,聚类等分析不仅要比较不同样本间同一个基因的差异,还要计算同一个样本内不同基因的贡献,显然
counts(dds, normalized=T)
没有包括这部分的标准化,如果直接用这个数据做分析,会导致样本内表达量大的基因对结果影响过大。但是如果简单用log+1的方式进行转换,又会导致表达量小的基因影响过大,因此DESeq2提出了rlog
和VST
的方法
RNA–Seq data, however, variance grows with the mean. For example, if one performs PCA directly on a matrix of normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples.
A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a small pseudocount; however, now the genes with low counts tend to dominate the results because, due to the strong Poisson noise inherent to small count values, they show the strongest relative differences between samples. Note that this effect can be diminished by adding a relatively high number of pseudocounts, e.g. 32, since this will also substantially reduce the variance of the fold changes.
疑问
如果提取了 counts(dds, normalized=T)
的数据,是否再进行scale或者log转换,可以达到类似rlog
和VST
的效果呢?
结果
使用 counts(dds, normalized=T)
标准化count值,然后再进行scale,PCA效果接近rlog
,但不是完全一样。
References
How can I extract normalized read count values from DESeq2 results
QC methods for DE analysis using DESeq2
Analysis of RNAseq data