R_学习专题单细胞测序后续分析RNA seq

DESeq2详细用法

2020-04-25  本文已影响0人  果蝇饲养员的生信笔记

1.构建DESeq2对象

(1)从SummarizedExperiment对象构建DESeqDataSet对象

dds <- DESeqDataSet( se, design = ~ cell + dex) 

(2)从表达矩阵countData和样品信息colData构建DESeqDataSet对象

dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ Group) 
构建DESeqDataSet对象

2.过滤低丰度数据

dds <- dds[rowSums(counts(dds)) > 1, ] 

或者在构建dds之前加上

 countData <- count[apply(count, 1, sum) > 1 , ] 

3.两种数据转化方法

(1)方差稳定变换,The variance stabilizing transformation

vsd <- vst(object=dds,blind=T) 

(2)正则化对数变换,The regularized-logarithm transformation

rld <- rlog(object=dds,blind=F) 

(3)用法

(4)为什么要转换?为了确保所有基因有大致相同的贡献。

对于RNA-seq raw counts,方差随均值增长。如果直接用size-factor-normalized read counts:counts(dds, normalized=T) 进行主成分分析,结果通常只取决于少数几个表达最高的基因,因为它们显示了样本之间最大的绝对差异。为了避免这种情况,一个策略是采用the logarithm of the normalized count values plus a small pseudocount:log2(counts(dds2, normalized=T) +1)。但是这样,有很低counts的基因将倾向于主导结果。作为一种解决方案,DESeq2为counts数据提供了stabilize the variance across the mean的转换。其中之一是regularized-logarithm transformation or rlog2。对于counts较高的基因,rlog转换可以得到与普通log2转换相似的结果。然而,对于counts较低的基因,所有样本的值都缩小到基因的平均值。
用于绘制PCA图或聚类的数据可以有多种:counts、CPM、log2(counts+1)、log2(CPM+1)、vst、rlog等。

4. DESeq2的标准化方法

(1)计算归一化系数sizeFactor

dds <- estimateSizeFactors(dds) 

(2)标准化之后的数据

normalized_counts <- counts(dds,normalized=T) 

5. 差异表达分析

(1)一步

dds <- DESeq(dds) 
DESeq2

(2)用法

DESeq(object, test = c("Wald", "LRT"), fit Type = c("parametric", "local", "mean"), sfType = c("ratio", "poscounts", "iterate"),betaPrior, full = design(object), reduced, quiet = FALSE, minReplicatesForReplace = 7, modelMatrixType, useT = FALSE, minmu = 0.5, parallel = FALSE, BPPARAM = bpparam())

(3)分步

dds <- estimateSizeFactors(dds) 
dds <- estimateDispersions(dds) 

DESeq2假定基因的表达量符合负二项分布,有两个关键参数,总体均值和离散程度α值。这个α值衡量的是均值和方差之间的关系。


负二项分布
dds <- nbinomWaldTest(dds) 

6. 获得分析结果

(1)默认情况

res <- results(dds) 

(2)比较任何两组数据

resultsNames(dds) 
resultsNames
res <- results(dds, name="Group_MMF_vs_WTF") 
res <- results(dds, contrast=c("Group"," MMF "," WTF ")) #后面的是对照 

(3)用法

results(object, contrast, name, lfcThreshold = 0, altHypothesis = c("greaterAbs", "less Abs", "greater", "less"), listValues = c(1, -1), cooksCutoff, independentFiltering = TRUE, alpha = 0.1, filter, theta, p Adjust Method = "BH", filterFun, format = c("DataFrame", "GRanges", "GRangesList"), test,
addMLE = FALSE, tidy = FALSE, parallel = FALSE, BPPARAM = bpparam(), minmu = 0.5)

dds <- makeExampleDESeqDataSet(n=100,m=18)
dds$genotype <- factor(rep(rep(c("I","II","III"),each=3),2))
design(dds) <- ~ genotype + condition + genotype:condition
dds$condition
[1] A A A A A A A A A B B B B B B B B B
Levels: A B
dds$genotype
[1] I   I   I   II  II  II  III   III   III   I   I   I   II  II  II  III  III  III
Levels: I   II   III
dds <- DESeq(dds)
resultsNames(dds)
[1] "Intercept"              "genotype_II_vs_I" 
[3] "genotype_III_vs_I"      "condition_B_vs_A" 
[5] "genotypeII.conditionB"  "genotypeIII.conditionB"
# the condition effect for genotype I (the main effect)(基因型I的条件效应,主要效应)
results(dds, contrast=c("condition","B","A"))
log2 fold change (MLE): condition B vs A 
Wald test p-value: condition B vs A
# the condition effect for genotype III.(基因型III的条件效应)
# this is the main effect *plus* the interaction term (主要效应加上相互作用项)
# (the extra condition effect in genotype III compared to genotype I).(基因型III与基因型I相比额外的条件作用效果)
results(dds, contrast=list( c("condition_B_vs_A","genotypeIII.conditionB") ))
log2 fold change (MLE): condition_B_vs_A+genotypeIII.conditionB effect 
Wald test p-value: condition_B_vs_A+genotypeIII.conditionB effect
# the interaction term for condition effect in genotype III vs genotype I.(基因型III与基因型I相比额外的条件作用效果)
# this tests if the condition effect is different in III compared to I(检测了条件效果在基因型III与基因型I之间是否有区别)
results(dds, name="genotypeIII.conditionB")
log2 fold change (MLE): genotypeIII.conditionB 
Wald test p-value: genotypeIII.conditionB 
# the interaction term for condition effect in genotype III vs genotype II. (基因型III与基因型II相比额外的条件作用效果)
# this tests if the condition effect is different in III compared to II(检测了条件效果在基因型III与基因型II之间是否有区别)
results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))
log2 fold change (MLE): genotypeIII.conditionB vs genotypeII.conditionB 
Wald test p-value: genotypeIII.conditionB vs genotypeII.conditionB
# Note that a likelihood ratio could be used to test if there are any
# differences in the condition effect between the three genotypes.
# This is a useful construction when users just want to compare
# specific groups which are combinations of variables.
dds$group <- factor(paste0(dds$genotype, dds$condition))
design(dds) <- ~ group
dds <- DESeq(dds)
resultsNames(dds)
# the condition effect for genotypeIII
results(dds, contrast=c("group", "IIIB", "IIIA"))

(4)结果的简单统计

summary(res) 
summary
summary(res, alpha=0.1) 

(5)排序和筛选

resOrdered <- res[order(res$pvalue), ]  #从小到大排序,默认decreasing = F 
sum(res$padj < 0.1, na.rm=TRUE)  #有多少padj小于0.1的 
diff_gene <-subset(res, padj < 0.1 & abs(log2FoldChange) > 1) 

(6)2种更严格的方法筛选显著差异基因

res.05 <- results(dds, alpha=0.05) 
#alpha为padj的阈值,默认padj=0.1。
resLFC <- results(dds, lfcThreshold=1) 
#提升log2 fold change threshold,结果中不满足lfc阈值的gene的p值都是1。

7. LFC校正lfcShrink

(1)介绍

Adds shrunken log2 fold changes (LFC) and SE to a results table from DESeq run without LFC shrinkage. For consistency with results, the column name lfcSE is used here although what is returned is a posterior SD. Three shrinkage estimators for LFC are available via type (see the
vignette for more details on the estimators). The apeglm publication demonstrates that ’apeglm’ and ’ashr’ outperform the original ’normal’ shrinkage estimator.

The shrunken fold changes are useful for ranking genes by effect size and for visualization.
缩小的倍数变化有助于按效应大小对基因进行排序和可视化。
log2FC estimates do not account for the large dispersion we observe with low read counts.
log2 FC估计不能解释我们在低read counts下观察到的大的离散程度。
As with the shrinkage of dispersion estimates, LFC shrinkage uses information from all genes to generate more accurate estimates.
与估计离散程度的收缩一样,LFC收缩使用来自所有基因的信息来生成更准确的估计。
如果要根据LFC值提取差异基因,需要shrunken values。另外,进行功能分析例如GSEA时,需要提供shrunken values。

(2)应用

resultsNames(dds) 
[1] "Intercept"  "condition_treated_vs_untreated"
resLFC <- lfcShrink(dds, coef="condition_treated_vs_untreated", type="apeglm") 
#或
resLFC <- lfcShrink(dds, contrast = c("condition"," treated "," untreated "))
names(resLFC) 
[1] "baseMean"  "log2FoldChange"  "lfcSE"  "pvalue"  "padj"       

(3)用法

lfcShrink(dds, coef, contrast, res, type = c("normal", "apeglm", "ashr"), lfcThreshold = 0, svalue = FALSE, return List = FALSE, format = c("DataFrame", "GRanges", "GRangesList"), apeAdapt = TRUE, apeMethod = "nbinomCR", parallel = FALSE, BPPARAM = bpparam(), quiet = FALSE, ...)

8. 似然比检验LRT

ddsLRT <- DESeq(dds, test="LRT", reduced=~1)
resLRT <- results(ddsLRT)

9. 开启多线程

library("BiocParallel")
register(MulticoreParam(4))
#先预定4个核,等需要的时候直接使用parallel=TRUE来调用。

10. Plotting results

(1)样本间关系热图(总体相似度)

library(pheatmap)
library(RColorBrewer)
rld
sampleDist <- dist(t(assay(rld)))  #样品距离,欧氏距离,t转置 
#为确保所有基因大致相同的contribution用rlog-transformed data 
#画某些基因在样本间的heatmap也可以用rlog数据 
#用PoiClaClu包计算泊松距离(Poisson Distance),必须是原始表达矩阵 
#poisd <- PoissonDistance(t(counts(dds))) 
sampleDistMatrix <- as.matrix(sampleDist)  #样品间距离的矩阵
rownames(sampleDistMatrix) <- paste0(rld $cell,"-", rld$dex)
colnames(sampleDistMatrix) <- NULL
head(sampleDistMatrix)  #样品间距离的数据框
colors <- colorRampPalette(rev(brewer.pal(9,"Blues")))(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDist,
         clustering_distance_cols=sampleDist,
         color = colors)

(2)多维尺度分析(multidimensional scaling,MDS)或主坐标分析(principal coordinates analysis,PCoA)

library(ggplot2)
#把样本之间的距离转化成二维坐标,在降维过程中保证样品点之间的距离不变
#MDS基于最小线性距离(欧氏距离)的聚类,与PCA的最大线性相关是一样的
#适合在没有表达矩阵值,但只有一个距离矩阵的情况下使用
mdsdata <- data.frame(cmdscale(sampleDistMatrix))
#cmdscale(classical multidimensional scaling)
mdsdata  #返回MDS的坐标
mds <- cbind(mdsdata,as.data.frame(colData(vsd)))
mds  #按列合并
ggplot(data=mds,aes(X1,X2,color=cell,shape=dex)) +
  geom_point(size=3)

(3)主成分分析(Principal Component Analysis,PCA)

pcadata <- plotPCA(vsd,intgroup = c("Batch","Group"), returnData=TRUE)
percentVar <- round(100*attr(pcadata,"percentVar"),1)
ggplot(pcadata, aes(PC1, PC2, color=Group, shape=Batch)) + 
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +
  geom_text_repel(aes(PC1, PC2,color=Group,label=colnames(vsd)),size=3) +
  theme_bw()

(4)plotCounts()函数查看特定基因的表达量

topGene <- rownames(res)[which.min(res$padj)]  #padj最小的一个基因
plotCounts(dds, gene=topGene, intgroup=c("dex"))  #画出这个基因的标准化后的表达量
#以散点图的形式画出这个基因在各样本中的表达量
data <- plotCounts(dds, gene=topGene, intgroup=c("dex","cell"), returnData=TRUE)
ggplot(data, aes(x=dex, y=count, color=cell)) +  
 scale_y_log10() +
 geom_point(position=position_jitter(width=.1,height=0), size=3)  
ggplot(data, aes(x=dex, y=count, fill=dex)) +
 scale_y_log10() +
 geom_dotplot(binaxis="y", stackdir="center")
ggplot(data, aes(x=dex, y=count, color=cell, group=cell)) +
 scale_y_log10() + geom_point(size=3) + geom_line()

(5)MA图

plotMA(res, ylim=c(-5,5))
#"M" for minus(减), because a log ratio is equal to log minus log, and "A" for average(均值)
#M对应差异对比组之间基因表达变化log2 fold changes (Y轴)
#A对应差异对比组基因表达量均值the mean of normalized counts (X轴)
plotMA(res, alpha = 0.1, main = "", xlab = "mean of normalized counts", ylim=c(-5,5))
#alpha为padj显著性水平阈值,默认alpha=0.1
#Each gene is represented with a dot. Genes with an adjusted p value below a threshold (here 0.1, the default) are shown in red.
plotMA(resLFC, ylim=c(-5,5))  #提高了log2 fold change阈值
topGene <- rownames(resLFC)[which.min(resLFC$padj)]
with(resLFC[topGene, ], {
 points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
 text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})  #标记出一个特定的基因

(6)Removing hidden batch effects

library("sva")
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
svseq$sv
par(mfrow=c(2,1),mar=c(3,5,3,1))
stripchart(svseq$sv[,1] ~ dds$cell,vertical=TRUE,main="SV1")
abline(h=0)
stripchart(svseq$sv[,2] ~ dds$cell,vertical=TRUE,main="SV2")
abline(h=0)
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + dex
ddssva <- DESeq(ddssva)
上一篇 下一篇

猜你喜欢

热点阅读