『读书笔记』- DESeq2 官方文件阅读笔记 01

2019-10-25 本文已影响0人 ShawnMagic

正文前先容我扯两句蛋

昨天群里交流WGCNA的时候和大家聊到软件教程，现在很多人来了直接就网上各种教程搜一大堆，然后follow别人的code，照葫芦画瓢做出一堆结果，符合预期就是皆大欢喜，做不出来就愁眉不展...

回想起大三时准备考研，天天埋头图书馆啃王镜岩的生化，Gene8，分子生物学，当时总结出来所有的东西必须自己过一遍才能成为自己的东西，不要迷信别人总结的东西，别人总结的终究是别人的，背的滚瓜烂熟也变不成自己的，所以学习没有捷径，还是需要静下心来慢慢自己啃，这样遇到问题追本溯源，才能更好的解决问题。

follow别人的流程大多会遇到一个问题，就是如果出错了会一头雾水，找不到出错的原因，教程固然不错，但是仅仅限于参考，如果真正的想要“学会“一个软件，还是要回归作者官方的manual，FAQ，article等等，作为数学渣渣的我，算法（article）就不去碰了，太难了，但至少做到会使用软件，关键function的原理和功能，FAQ，manual是必须看看的。虽然follow别人，edgeR，DESeq2已经可以跑通了，但回去啃下官方的文档还是非常必要的。

官方文档 Source: bioconductor

关于算法原理部分，statquest连出了多个教程，慢慢看

StatQuest -edgeR 和 DESeq2 之 Independent Filtering Source： B站
StatQuest - edgeR 之 Library Normalization Source: B站

Part1. 数据准备和参数设置

Why un-normalized counts?

As input, the DESeq2 package expects count data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of integer values. The value in the i-th row and the j-th column of the matrix tells how many reads can be assigned to gene i in sample j. Analogously, for other types of assays, the rows of the matrix might correspond e.g. to binding regions (with ChIP-Seq) or peptide sequences (with quantitative mass spectrometry). We will list method for obtaining count matrices in sections below.

首先，inputdata应该是一个表达矩阵，类似下图:
行名为转录本的ID，列名为样本的名字，
矩阵中为表达量。

Transcript_ID	Sample1	Sample2	Sample3	Sample4
Transcript1	0	1	10	5
Transcript2	100	104	107	100
Transcript3	30	35	37	39
Transcript4	42	44	45	200

The values in the matrix should be un-normalized counts or estimated counts of sequencing reads (for single-end RNA-seq) or fragments (for paired-end RNA-seq). The RNA-seq workflow describes multiple techniques for preparing such count matrices. It is important to provide count matrices as input for DESeq2’s statistical model (Love, Huber, and Anders 2014) to hold, as only the count values allow assessing the measurement precision correctly. The DESeq2 model internally corrects for library size, so transformed or normalized values such as counts scaled by library size should not be used as input.

这里指出了标准化后的counts是不能用的，比如FPKM，RPKM，TPM等.
原因是DESeq2会在内部校正文库大小。

The DESeqDataSet

The object class used by the DESeq2 package to store the read counts and the intermediate estimated quantities during statistical analysis is the DESeqDataSet, which will usually be represented in the code here as an object dds.

A technical detail is that the DESeqDataSet class extends the RangedSummarizedExperiment class of the SummarizedExperiment package. The “Ranged” part refers to the fact that the rows of the assay data (here, the counts) can be associated with genomic ranges (the exons of genes). This association facilitates downstream exploration of results, making use of other Bioconductor packages’ range-based functionality (e.g. find the closest ChIP-seq peaks to the differentially expressed genes).

A DESeqDataSet object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should be a tilde (~) followed by the variables with plus signs between them (it will be coerced into an formula if it is not already). The design can be changed later, however then all differential analysis steps should be repeated, as the design formula is used to estimate the dispersions and to estimate the log2 fold changes of the model.

Note: In order to benefit from the default settings of the package, you should put the variable of interest at the end of the formula and make sure the control level is the first level.

We will now show 4 ways of constructing a DESeqDataSet, depending on what pipeline was used upstream of DESeq2 to generated counts or estimated counts:

From transcript abundance files and tximport

From a count matrix

From htseq-count files

From a SummarizedExperiment object

一个叫dds的object来储存DESeqDataSet，这个dds必须关联一个设计公式（design formula）该公式会由一个~符号开始，后面连接相应的变量，不同变量间用+连接.
根据上游不同的流程分别对应了不同的建立dds的方法.

这里主要看基于count matrix的，因为我比较喜欢用featurecount来做定量,此外测序公司返回的数据中一般也有readcount的matrix

Count matrix input

Alternatively, the function DESeqDataSetFromMatrix can be used if you already have a matrix of read counts prepared from another source. Another method for quickly producing count matrices from alignment files is the featureCounts function (Liao, Smyth, and Shi 2013) in the Rsubread package. To use DESeqDataSetFromMatrix, the user should provide the counts matrix, the information about the samples (the columns of the count matrix) as a DataFrame or data.frame, and the design formula.

To demonstate the use of DESeqDataSetFromMatrix, we will read in count data from the pasilla package. We read in a count matrix, which we will name cts, and the sample information table, which we will name coldata. Further below we describe how to extract these objects from, e.g. featureCounts output.

官方例子

library("pasilla")
pasCts <- system.file("extdata",
                      "pasilla_gene_counts.tsv",
                      package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
                       "pasilla_sample_annotation.csv",
                       package="pasilla", mustWork=TRUE)
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
coldata <- read.csv(pasAnno, row.names=1)
coldata <- coldata[,c("condition","type")]
## 得到的结果

head(cts,2)
##             untreated1 untreated2 untreated3 untreated4 treated1 treated2
## FBgn0000003          0          0          0          0        0        0
## FBgn0000008         92        161         76         70      140       88
##             treated3
## FBgn0000003        1
## FBgn0000008       70
coldata
##              condition        type
## treated1fb     treated single-read
## treated2fb     treated  paired-end
## treated3fb     treated  paired-end
## untreated1fb untreated single-read
## untreated2fb untreated single-read
## untreated3fb untreated  paired-end
## untreated4fb untreated  paired-end

## 发现顺序好像不一致

rownames(coldata) <- sub("fb", "", rownames(coldata))
all(rownames(coldata) %in% colnames(cts))
## [1] TRUE
all(rownames(coldata) == colnames(cts))
## [1] FALSE
cts <- cts[, rownames(coldata)]
all(rownames(coldata) == colnames(cts))
## [1] TRUE

## 拿到cts 和colnames后可以着手构建dds
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds
## class: DESeqDataSet 
## dim: 14599 7 
## metadata(1): version
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574
##   FBgn0261575
## rowData names(0):
## colnames(7): treated1 treated2 ... untreated3 untreated4
## colData names(2): condition type
############################################################################
# If you have additional feature data, it can be added to the DESeqDataSet by adding to the metadata columns of a newly constructed object. (Here we add redundant data just for demonstration, as the gene names are already the rownames of the dds.)
##########################################################################

featureData <- data.frame(gene=rownames(cts))
mcols(dds) <- DataFrame(mcols(dds), featureData)
mcols(dds)

## DataFrame with 14599 rows and 1 column
##                    gene
##                <factor>
## FBgn0000003 FBgn0000003
## FBgn0000008 FBgn0000008
## FBgn0000014 FBgn0000014
## FBgn0000015 FBgn0000015
## FBgn0000017 FBgn0000017
## ...                 ...
## FBgn0261571 FBgn0261571
## FBgn0261572 FBgn0261572
## FBgn0261573 FBgn0261573
## FBgn0261574 FBgn0261574
## FBgn0261575 FBgn0261575

Collapsing technical replicates

DESeq2 provides a function collapseReplicates which can assist in combining the counts from technical replicates into single columns of the count matrix. The term technical replicate implies multiple sequencing runs of the same library. You should NOT collapse biological replicates using this function. See the manual page for an example of the use of collapseReplicates.

合并技术重复的counts，这里特别强调了不可以合并生物学重复的！！！

Differential expression analysis

The standard differential expression analysis steps are wrapped into a single function, DESeq. The estimation steps performed by this function are described below, in the manual page for ?DESeq and in the Methods section of the DESeq2 publication (Love, Huber, and Anders 2014).

Results tables are generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values. With no additional arguments to results, the log2 fold change and Wald test p value will be for the last variable in the design formula, and if this is a factor, the comparison will be the last level of this variable over the reference level (see previous note on factor levels). However, the order of the variables of the design do not matter so long as the user specifies the comparison to build a results table for, using the name or contrast arguments of results.

Details about the comparison are printed to the console, directly above the results table. The text, condition treated vs untreated, tells you that the estimates are of the logarithmic fold change log2(treated/untreated).

结果表格中通常包含了以下几个内容：
1. log2foldchange;
2. p value
3. adjusted p value.（校正过的p值，或者叫 q value）
4. 如果实验设计没有特殊的要求，log2FC和q value将会成为最终判定是否差异的标准。
再实验设计的时候需要对sample进行标记，分清楚case和control，这部分在condition中分清楚。

举个🌰


dds <- DESeq(dds)
res <- results(dds)
res
## log2 fold change (MLE): condition treated vs untreated 
## Wald test p-value: condition treated vs untreated 
## DataFrame with 9921 rows and 6 columns
##                     baseMean      log2FoldChange             lfcSE
##                    <numeric>           <numeric>         <numeric>
## FBgn0000008 95.1442917575889 0.00227644122547027  0.22372865161848
## FBgn0000014 1.05652281859341  -0.495120386253493  2.14318579304427
## FBgn0000017 4352.55356876647   -0.23991894353759 0.126336905404352
## FBgn0000018  418.61048415965  -0.104673911941353 0.148489059780011
## FBgn0000024   6.406199980976   0.210847791831903 0.689587553167661
## ...                      ...                 ...               ...
## FBgn0261570 3208.38861003698    0.29553288972209 0.127350479276721
## FBgn0261572 6.19718814545467  -0.958822964597639 0.775314665691102
## FBgn0261573 2240.97951122377  0.0127194420456142 0.113299975906513
## FBgn0261574 4857.68037348332  0.0153924162130344 0.192567170474486
## FBgn0261575 10.6825203335563   0.163570514198072  0.93091062692625
##                           stat             pvalue              padj
##                      <numeric>          <numeric>         <numeric>
## FBgn0000008 0.0101750098121193  0.991881656848261 0.997210766670927
## FBgn0000014 -0.231020748579247  0.817298682951798                NA
## FBgn0000017  -1.89904084455535 0.0575591059082212 0.288001711413016
## FBgn0000018 -0.704926760910396  0.480855815353132 0.826833683766379
## FBgn0000024  0.305759277213403  0.759787936488384 0.943501114514855
## ...                        ...                ...               ...
## FBgn0261570   2.32062644287284 0.0203070137750067 0.144240002513884
## FBgn0261572  -1.23668880136939  0.216202637789157 0.607847805203261
## FBgn0261573  0.112263413507779  0.910614550167173  0.98265666676088
## FBgn0261574 0.0799327121809364  0.936290772501257 0.988179230260634
## FBgn0261575  0.175710223373603   0.86052160317937  0.96792800379094
##############################################
# Note that we could have specified the coefficient or contrast we want to build a results table for, using either of the following equivalent commands:
##############################################
res <- results(dds, name="condition_treated_vs_untreated")
res <- results(dds, contrast=c("condition","treated","untreated"))

One exception to the equivalence of these two commands, is that, using contrast will additionally set to 0 the estimated LFC in a comparison of two groups, where all of the counts in the two groups are equal to 0 (while other groups have positive counts). As this may be a desired feature to have the LFC in these cases set to 0, one can use contrast to build these results tables. More information about extracting specific coefficients from a fitted DESeqDataSet object can be found in the help page ?results. The use of the contrast argument is also further discussed below.

存在一种情况，当这两个group中所有count都为0的时候，这两个group的log2fc也需要设为0。

Log fold change shrinkage for visualization and ranking

Shrinkage of effect size (LFC estimates) is useful for visualization and ranking of genes. To shrink the LFC, we pass the dds object to the function lfcShrink. Below we specify to use the apeglm method for effect size shrinkage (Zhu, Ibrahim, and Love 2018), which improves on the previous estimator.

We provide the dds object and the name or number of the coefficient we want to shrink, where the number refers to the order of the coefficient as it appears in resultsNames(dds).

刚开始理解不了log2FC收缩，看标题意思是可视化过程会变得好看，看了徐洲更老师的简书稍微理解了一点：

正常的log2FC MA plot

收缩后的log2FC

效果显而易见了。

其他参数：

Using parallelization

The above steps should take less than 30 seconds for most analyses. For experiments with complex designs and many samples (e.g. dozens of coefficients, ~100s of samples), one can take advantage of parallelized computation. Parallelizing DESeq, results, and lfcShrink can be easily accomplished by loading the BiocParallel package, and then setting the following arguments: parallel=TRUE and BPPARAM=MulticoreParam(4), for example, splitting the job over 4 cores. Note that results for coefficients or contrasts listed in resultsNames(dds) is fast and will not need parallelization. As an alternative to BPPARAM, one can register cores at the beginning of an analysis, and then just specify parallel=TRUE to the functions when called.

library("BiocParallel")
register(MulticoreParam(4))

多线程并行，利用BiocParallel 开始的时候将job拆分给4个线程，会加快计算速度。
主要就是用register(MulticoreParam(4))

p-values and adjusted p-values

# We can order our results table by the smallest p value:

resOrdered <- res[order(res$pvalue),]

# We can summarize some basic tallies using the summary function.

summary(res)
## 
## out of 9921 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 518, 5.2%
## LFC < 0 (down)     : 536, 5.4%
## outliers [1]       : 1, 0.01%
## low counts [2]     : 1539, 16%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
How many adjusted p-values were less than 0.1?

sum(res$padj < 0.1, na.rm=TRUE)
## [1] 1054
## The results function contains a number of arguments to customize the results table which is generated. You can read about these arguments by looking up ?results. Note that the results function automatically performs independent filtering based on the mean of normalized counts for each gene, optimizing the number of genes which will have an adjusted p value below a given FDR cutoff, alpha. Independent filtering is further discussed below. By default the argument alpha is set to 0.1
## . If the adjusted p value cutoff will be a value other than 0.1
## , alpha should be set to that value:

res05 <- results(dds, alpha=0.05)
summary(res05)
## 
## out of 9921 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 407, 4.1%
## LFC < 0 (down)     : 431, 4.3%
## outliers [1]       : 1, 0.01%
## low counts [2]     : 1347, 14%
## (mean count < 5)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
sum(res05$padj < 0.05, na.rm=TRUE)
## [1] 838

Independent hypothesis weighting

A generalization of the idea of p value filtering is to weight hypotheses to optimize power. A Bioconductor package, IHW, is available that implements the method of Independent Hypothesis Weighting (Ignatiadis et al. 2016). Here we show the use of IHW for p value adjustment of DESeq2 results. For more details, please see the vignette of the IHW package. The IHW result object is stored in the metadata.

Note: If the results of independent hypothesis weighting are used in published research, please cite:

Ignatiadis, N., Klaus, B., Zaugg, J.B., Huber, W. (2016) Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nature Methods, 13:7. 10.1038/nmeth.3885

直译过来是独立假设权重，数学渣渣表示根本不知道这是个什么鬼，下面的解释是对p值得一个筛选，这个筛选基于一个优化的权重值。（还是听不懂，😭）
太费脑经不打算搞懂了，反正姑且认为是一个和adjusted p value类似的东西吧...
需要使用到一个叫IHW包，用法如下：
如果使用了这个参数，需要引用上面的文献。

library("IHW")
resIHW <- results(dds, filterFun=ihw)
summary(resIHW)
## 
## out of 9921 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 507, 5.1%
## LFC < 0 (down)     : 545, 5.5%
## outliers [1]       : 1, 0.01%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting
sum(resIHW$padj < 0.1, na.rm=TRUE)
## [1] 1052
metadata(resIHW)$ihwResult
## ihwResult object with 9921 hypothesis tests 
## Nominal FDR control level: 0.1 
## Split into 6 bins, based on an ordinal covariate

发现和上面P value < 0.05的结果相比，差异的基因更多了。

下一部分更新 Exploring and exporting results