芯片数据分析芯片数据分析生物信息学

学习limma: Linear Models for Micro

2020-05-13  本文已影响0人  果蝇饲养员的生信笔记

一、简介

limma应用于RNA-seq数据时,read counts被转换为log2-counts-per-million(logCPM)。可以有两种方式对均值-方差的关系(mean-variance relationship)进行建模:
(1)voom:精确权重(precision weights)
(2)limma-trend:经验贝叶斯先验趋势(empirical Bayes prior trend)
在这两种情况下,RNA-seq数据可以被作为微阵列数据(microarray data)进行分析。这意味着limma包中的任何线性建模或基因集测试方法都可以应用于RNA-seq数据。

limma使用线性模型(linear models)的方法来分析设计的微阵列实验。该方法需要指定一个或两个矩阵:
(1)第一种是设计矩阵(design matrix),它表明了每个阵列使用了哪些RNA样本。
(2)第二种是对比矩阵(contrast matrix),它指定了希望在RNA样本之间进行哪些比较。
对于非常简单的实验,可能不需要指定对比矩阵。

对于统计分析和评估差异表达,limma使用了一种经验贝叶斯方法(empirical Bayes method)来调整估计的倍数变化的标准误差(moderate the standard errors of the estimated log-fold changes)。这导致了更稳定的推理和更高的功效,特别是对于样本量少的实验。

基本过程:

二、RNA-seq数据的前处理

(1)制作计数矩阵

> head(countdata,3) 
          WTF1 WTF2 WTF3 WTM1 WTM2 WTM3 MF1 MF2 MF3 MF4 MF5 MMF1 MMF2
gene17427    0   46  210    0   26    0  40   0   0   0   0    0    0
gene17428   50  322  334   0  345  384  284  199  218  11   1    2    1
gene17429    0    0    0    6    0    0   0   0   0   0   5    0    0

(2)样本分组信息

> Group
 [1] WTF WTF WTF WTM WTM WTM MF  MF  MF  MF  MF  MMF MMF
Levels: WTF WTM MF MMF

(3)基因注释信息

> head(anno)
          chromosome ref_gene_name
gene17427          4       CR32011
gene17428          4       CR32010
gene17429          4       CR32009
gene17490          4       CR44028
gene17501          4       CR33797
gene17507          4       CR43361

(4)使用edgeR,将计数矩阵转换为DGEList对象

> y <- DGEList(counts=countdata) 
> y 
An object of class "DGEList"
$counts
          WTF1 WTF2 WTF3 WTM1 WTM2 WTM3 MF1 MF2 MF3 MF4 MF5 MMF1 MMF2
gene17427    0   46  210    0   26    0  40   0   0   0   0    0    0
gene17428   50  322  334   0  345   384  284  199  218  11   1    2    1
gene17429    0    0    0    6    0    0   0   0   0   0   5    0    0
gene17490    4    0    3    0    0    0   0   3   0   0   0    0    0
gene17501    2    2    3    1    0    3   4   0   3   0   0    0    0
34367 more rows ...
$samples
     group lib.size norm.factors
WTF1     1 26485372            1
WTF2     1 21708533            1
WTF3     1 26038158            1
WTM1     1 14282144            1
WTM2     1 27198208            1
8 more rows ...

> y$samples$group <- Group #加入分组信息
> y$genes <- anno #加入基因注释

(5)移除表达量非常低的基因

> #保留在至少在2个样品中表达量大于min.CPM的基因
> keep <- rowSums(cpm(y)>1) >= 2
> yf <- y[keep,,keep.lib.sizes=FALSE] #重置了库的大小lib.size
> y$samples$lib.size - yf$samples$lib.size
 [1]  26499  39607  80456  25102 125345 105390  63741  24067  46952
[10]  37984  44858  35573  34754
> nrow(y) - nrow(yf) #减少的基因数
[1] 11844

> #另一种过滤方法filterByExpr
> keep <- filterByExpr(y,group=Group)
> yf <- y[keep,,keep.lib.sizes=FALSE]
> y$samples$lib.size - yf$samples$lib.size
 [1] 11532 22890 58275 10496 88993 68069 35694 10642 24187 14905
[11] 22093 16030 19544
> nrow(y) - nrow(yf)
[1] 9521

用法:

filterByExpr(y, design = NULL, group = NULL, lib.size = NULL, ...)
filterByExpr(y, design = NULL, group = NULL, lib.size = NULL, min.count = 10, min.total.count = 15, large.n = 10, min.prop = 0.7, ...)

该方法保留在n个样本中有大于k的CPM的基因。k由min.count 和lib.size决定,n由设计矩阵决定。n是最小的分组的样本量。如果所有的分组样本量都大于large.n,这就会稍微放松,但是n总是大于最小分组的min.prop(70%)。此外,每个保留的基因都需要在所有样本中至少有min.total.count的计数。

(6)计算CPM和logCPM

> #cpmyf <- cpm(yf)
> lcpmyf <- cpm(yf,log=T) 

用法:

cpm(y,normalized.lib.sizes=T,log=F,prior.count=2)

(7)箱线图

> col.group <- Group
> levels(col.group) <- brewer.pal(nlevels(col.group),"Set1")
> col.group <- as.character(col.group)
> col.group
 [1] "#E41A1C" "#E41A1C" "#E41A1C" "#377EB8" "#377EB8" "#377EB8"
 [7] "#4DAF4A" "#4DAF4A" "#4DAF4A" "#4DAF4A" "#4DAF4A" "#984EA3"
[13] "#984EA3"
> boxplot(lcpmyf,las=2,col=col.group)
boxplot.JPG

(8)TMM标准化(trimmed mean of M-values)

> yf <- calcNormFactors(yf,method = "TMM") #计算归一化系数
> #标准化后raw read counts值并没改变,只有norm.factors变了
> yf$samples$norm.factors
 [1] 0.9027210 1.3505878 1.3102042 0.9202737 1.3636074 1.4155932
 [7] 0.9726943 1.0979219 0.9623963 0.9108200 0.7308098 0.7473150
[13] 0.6892828

> ##获得标准化后的数据
> #cpmyf <- cpm(yf) #已经考虑归一化系数了,等于cpmyf/norm.factors
> #t(t(yf$counts)/yf$samples$lib.size/yf$samples$norm.factors)*1000000
> lcpmyf <- cpm(yf,log=T)
> boxplot(lcpmyf,las=2,col=col.group)

三、设计矩阵和对比矩阵

需要两种矩阵:

有两种方法创建设计矩阵:

(1)对照组作为截距

> Group
 [1] WTF WTF WTF WTM WTM WTM MF  MF  MF  MF  MF  MMF MMF
Levels: WTF WTM MF MMF
> design <- model.matrix(~Group)
> colnames(design) <- c("WTF","WTMvsWTF","MFvsWTF","MMFvsWTF")
> design
   WTF WTMvsWTF MFvsWTF MMFvsWTF
1    1        0       0        0
2    1        0       0        0
3    1        0       0        0
4    1        1       0        0
5    1        1       0        0
6    1        1       0        0
7    1        0       1        0
8    1        0       1        0
9    1        0       1        0
10   1        0       1        0
11   1        0       1        0
12   1        0       0        1
13   1        0       0        1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$Group
[1] "contr.treatment"

(2)截距为0,对照组和实验组分别有自己的系数

> design <- model.matrix(~0+Group)
> colnames(design) <- c("WTF","WTM","MF","MMF")
> design
   WTF WTM MF MMF
1    1   0  0   0
2    1   0  0   0
3    1   0  0   0
4    0   1  0   0
5    0   1  0   0
6    0   1  0   0
7    0   0  1   0
8    0   0  1   0
9    0   0  1   0
10   0   0  1   0
11   0   0  1   0
12   0   0  0   1
13   0   0  0   1
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$Group
[1] "contr.treatment"

(3)对比矩阵

> cont.matrix <- makeContrasts(WTMvsWTF = WTM-WTF, MFvsWTF = MF-WTF, 
+                         MMFvsWTF = MMF-WTF, MMFvsMF = MMF-MF,
+                         levels = colnames(design))
> cont.matrix
      Contrasts
Levels WTMvsWTF MFvsWTF MMFvsWTF MMFvsMF
   WTF       -1      -1       -1       0
   WTM        1       0        0       0
   MF         0       1        0      -1
   MMF        0       0        1       1

用法:

makeContrasts(..., contrasts=NULL, levels)

> makeContrasts(B-A,C-B,C-A,levels=c("A","B","C"))
      Contrasts
Levels B - A C - B C - A
     A    -1     0    -1
     B     1    -1     0
     C     0     1     1

(4)成对样本的比较

成对样本的比较出现在我们比较两种处理条件,并且每个样本给予一个处理自然地与一个给予另一种处理的特定的样本配对的情况。与这种情况相关的经典检验是配对t检验。Paired samples occur when we compare two treatments and each sample given one treatment is naturally paired with a particular sample given the other treatment. This is a special case of blocking with blocks of size two. The classical test associated with this situation is the paired t-test.

上述对于成对样本的比较也可以用于存在batch effects或实验从不同的blocks进行的情况。处理条件仅在每一个模块内进行比较。
The above approach used for paired samples can be applied in any situation where there are batch effects or where the experiment has been conducted in blocks. The treatments can be adjusted for differences between the blocks by using a model formula of the form:

> design <- model.matrix(~Block+Treatment) 

In this type of analysis, the treatments are compared only within each block.

四、差异表达分析

limma最初设计用于分析微阵列数据,但目前已扩展到RNA-seq数据、DNA甲基化等。limma应用于RNA-seq数据,也是基于raw count的定量方式,但是它并不提供归一化算法,在官方手册中推荐采用edgeR的TMM归一化算法。read counts被转换为log2-counts-per-million(logCPM),并假设log-CPM值符合正态分布。可以有两种方式调整均值-方差的关系(mean-variance relationship),从而对log-CPM值进行线性建模:

在这两种情况下,RNA-seq数据可以被作为微阵列数据(microarray data)进行分析。这意味着limma包中的任何线性建模或基因集测试方法都可以应用于RNA-seq数据。

如果测序深度在RNA样本之间差异不大(最大library size和最小library size相差3倍以内),那么使用limma-trend是最简单和最稳健的方法。将计数数据转化为logCPM,然后logCPM值可以在任何标准的limma pipeline中进行分析。在运行eBayes或treat时使用trend=TRUE。
当library sizes在样本之间差距较大时,voom方法在理论上比limma-trend更有效。

(1)limma-trend

> lcpmyf <- cpm(yf,log=T)
> design <- model.matrix(~0+Group)
> colnames(design) <- gsub("Group","",colnames(design))
> rownames(design) <- colnames(yf)
> cont.matrix <- makeContrasts(WTMvsWTF = WTM-WTF, MFvsWTF = MF-WTF, 
+                              MMFvsWTF = MMF-WTF, MMFvsMF = MMF-MF,
+                              levels = colnames(design))
> cont.matrix
      Contrasts
Levels WTMvsWTF MFvsWTF MMFvsWTF MMFvsMF
   WTF       -1      -1       -1       0
   WTM        1       0        0       0
   MF         0       1        0      -1
   MMF        0       0        1       1
> fit <- lmFit(lcpmyf, design) 
> fit2 <- contrasts.fit(fit, cont.matrix) 
> fit3 <- eBayes(fit2, trend=TRUE) 
> topTable(fit3, coef=ncol(design))
             logFC   AveExpr         t      P.Value    adj.P.Val        B
rna23260  9.733111 -2.401474  18.96688 6.770586e-10 1.525278e-05 7.506843
rna4289   8.671071 -2.564865  17.44008 1.691543e-09 1.905354e-05 7.248625
rna6574   7.687889 -2.716123  15.91767 4.555986e-09 2.608897e-05 6.933027
> fit4 <- treat(fit2, lfc=log2(1.2),trend=TRUE) 
> topTreat(fit4, coef=ncol(design))
            logFC     AveExpr        t      P.Value    adj.P.Val
rna23260 9.733111 -2.40147385 21.44573 2.000093e-10 4.505809e-06
rna4289  8.671071 -2.56486452 19.03239 7.024639e-10 7.912553e-06
rna34624 8.412221 -2.49050322 17.32005 1.908439e-09 1.433110e-05
> dt <- decideTests(fit3)
> head(dt,3)
TestResults matrix
           Contrasts
            WTMvsWTF MFvsWTF MMFvsWTF MMFvsMF
  gene17427        0       0        0       0
  gene17428        0       0        0       0
  gene17507        0       0        0       0
> summary(dt)
       WTMvsWTF MFvsWTF MMFvsWTF MMFvsMF
Down         90      14       23      16
NotSig    20700   22477    22452   22481
Up         1738      37       53      31
> write.fit(fit3, dt, file="results.txt")

(2)voom

> v <- voom(yf, design, plot=TRUE) 
> v #EList对象
voom
> fit <- lmFit(v, design) 
> fit2 <- contrasts.fit(fit, cont.matrix)
> fit3 <- eBayes(fit2)
> plotSA(fit3, main="Final model: Mean-variance trend")
plotSA
> restop <- topTable(fit3, coef=ncol(design), n=Inf)

五、limma用法

(1)voom——Transform RNA-Seq Data Ready for Linear Modelling

voom(counts, design = NULL, lib.size = NULL, normalize.method = "none", block = NULL, correlation = NULL, weights = NULL, span = 0.5, plot = FALSE, save.plot = FALSE)

(2)lmFit

lmFit(object, design=NULL, ndups=1, spacing=1, block=NULL, correlation, weights=NULL, method="ls", ...)

(3)contrasts.fit

contrasts.fit(fit, contrasts=NULL, coefficients=NULL)

contrasts.fit

(4)eBayes

eBayes(fit, proportion = 0.01, stdev.coef.lim = c(0.1,4), trend = FALSE, robust = FALSE, winsor.tail.p = c(0.05,0.1))
treat(fit, lfc = log2(1.2), trend = FALSE, robust = FALSE, winsor.tail.p = c(0.05,0.1))

eBayes

(5)toptable

topTable(fit, coef=NULL, number=10, genelist=fitgenes, adjust.method="BH", sort.by="B", resort.by=NULL, p.value=1, lfc=0, confint=FALSE) topTableF(fit, number=10, genelist=fitgenes, adjust.method="BH", sort.by="F", p.value=1, lfc=0)
topTreat(fit, coef=1, sort.by="p", resort.by=NULL, ...)

toptable结果

(6)decideTests

decideTests(object, method = "separate", adjust.method = "BH", p.value = 0.05, lfc = 0, ...)

(7)write.fit

write.fit(fit, results = NULL, file, digits = NULL, adjust = "none", method = "separate", F.adjust = "none", quote = FALSE, sep = "\t", row.names = TRUE, ...)

This function writes a tab-delimited text file containing for each gene (1) the average log2-intensity, (2) the coefficients or contrasts (log2-fold-changes), (3) moderated t-statistics, (4) t-statistic P-values, (5) F-statistic if available, (6) F-statistic P-values if available, (7) classification if available and (8) gene names and annotation.

> dt <- decideTests(fit3)
> write.fit(fit3, dt, file="resultsvoom.txt")
> restxt <- read.table("resultsvoom.txt")
> colnames(restxt)
 [1] "A"                   "Coef.WTMvsWTF"      
 [3] "Coef.MFvsWTF"        "Coef.MMFvsWTF"      
 [5] "Coef.MMFvsMF"        "t.WTMvsWTF"         
 [7] "t.MFvsWTF"           "t.MMFvsWTF"         
 [9] "t.MMFvsMF"           "p.value.WTMvsWTF"   
[11] "p.value.MFvsWTF"     "p.value.MMFvsWTF"   
[13] "p.value.MMFvsMF"     "F"                  
[15] "F.p.value"           "Res.WTMvsWTF"       
[17] "Res.MFvsWTF"         "Res.MMFvsWTF"       
[19] "Res.MMFvsMF"         "Genes.chromosome"   
[21] "Genes.ref_gene_name"
上一篇下一篇

猜你喜欢

热点阅读