RNA-seq nomalization
7 Normalization
As different libraries will be sequenced to different depths, offsets are built in the statistical model of DESeq2 to ensure that parameters are comparable.
The term normalization is often used for that, but it should be noted that the raw read counts are not actually altered. DESeq2defines a virtual reference sample by taking the median of each gene’s values across samples and then computes size factors as the median of ratios of each sample to the reference sample.
Generally, the ratios of the size factors should roughly match the ratios of the library sizes. Dividing each column of the count table by the corresponding size factor yields normalized count values, which can be scaled to give a counts per million interpretation.
Thus, if all size factors are roughly equal to one, the libraries have been sequenced equally deeply.
Count data sets typically show a (left–facing) trombone shape in average vs mean–difference plots (MA–plots), reflecting the higher variability of log ratios at lower counts. In addition, points will typically be centered around a log ratio of 0 if the normalization factors are calculated appropriately, although this is just a general guide.
In the code below we produce density estimates of the sample counts and pairwise mean–difference plots after the computation of the size factors. The MA–plots are saved to a file in order to avoid cluttering the document.
We first relevel the condition factor to make sure that we have the wild type samples as a base level. This will allow us to obtain the fold changes always as WT–deletion. We also rename the deletion level to “Del”.
colData(DESeq2Table)$condition <- factor(colData(DESeq2Table)$condition, levels = c("WT", "Del"))
colData(DESeq2Table)$condition <- factor(ifelse(is.na(colData(DESeq2Table)$condition), "Del", "WT"), levels = c("WT", "Del"))
We now estimate estimate the size factors an print them
DESeq2Table <- estimateSizeFactors(DESeq2Table)
sizeFactors(DESeq2Table)
## WT-SRR1042885 WT-SRR1042886 WT-SRR1042887 WT-SRR1042888
## 0.8362 0.6322 1.3975 1.3709
## Del_8_17_homo-SRR1042889 Del_8_17_homo-SRR1042890 Del_8_17_homo-SRR1042891 Del_8_17_homo-SRR1042892
## 1.0086 0.4473 0.8602 2.6086
To assess whether the normalization has worked, we plot the densities of counts for the different samples. Since most of the genes are (heavily) affected by the experimental conditions, a succesful normalization will lead to overlapping densities.
multidensity( counts(DESeq2Table, normalized = T)[idx.nz ,],
xlab="mean counts", xlim=c(0, 1000))
image.png
This is indeed the case here. The same holds true for the empircal cummulative distribution functions (ECDFs) which can be thought of as integrals of the densities and give the probability of observing a certain number of counts equal to <nobr aria-hidden="true" style="transition: none 0s ease 0s; border: 0px; padding: 0px; margin: 0px; max-width: none; max-height: none; min-width: 0px; min-height: 0px; vertical-align: 0px; line-height: normal; text-decoration: none; white-space: nowrap !important;">x</nobr><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math> or less given the data.
In an ECDF plot, the estimated probility is plotted on the y–axis and the count values on the x–axis. I.e. you can read of the median and other quantiles from this plot.
As already mentioned, if the normalization has worked, the ECDFs of the different samples should be overlapping.
multiecdf( counts(DESeq2Table, normalized = T)[idx.nz ,],
xlab="mean counts", xlim=c(0, 1000))
image.png
To further assess systematic differences between the samples, we can also plot pairwise mean–average plots: We plot the average of the log–transformed counts vs the fold change per gene for each of the sample pairs.
The function combn
helps us to automatically generate all sample pairs and the function MDPlot
from the EDASeq package then generates the the pairwise MA plots.
pdf("pairwiseMAs.pdf")
MA.idx = t(combn(1:8, 2))
for( i in seq_along( MA.idx[,1])){
MDPlot(counts(DESeq2Table, normalized = T)[idx.nz ,],
c(MA.idx[i,1],MA.idx[i,2]),
main = paste( colnames(DESeq2Table)[MA.idx[i,1]], " vs ",
colnames(DESeq2Table)[MA.idx[i,2]] ), ylim = c(-3,3))
}
dev.off()
## png
## 2
The plots look good, if there is no systematic shift between the samples, the log–fold–change should scatter around zero, which is the case here.