R语言与科研转录组数据分析

R实战 | 用R也可以完成的RNA-Seq分析-3

2019-10-01  本文已影响0人  尘世中一个迷途小书僮

Differential Expression of RNA-seq data

本文将要介绍的是在R中进行RNA-seq 数据基因表达差异分析的实战代码.

原文地址:https://bioinformatics-core-shared-training.github.io/RNAseq-R/rna-seq-de.nb.html

书接上文(用R也可以完成的RNA-Seq下游分析-2

在预处理生成了表达矩阵且标准化后,接下来我们要做的就是差异分析了。


本次流程需要的包

同时载入上次预处理完的数据preprocessing.Rdata

library(edgeR)
library(limma)
library(Glimma)
library(gplots)
library(org.Mm.eg.db)
load("preprocessing.Rdata")

edgeR包的分析流程中,构建了表达矩阵后,便需要使用model.matrix函数创建保存有分组信息的矩阵design matrix。该矩阵以\color{red}{0/1}的方式存储了分组信息。本次分析中的分组信息包括了:小鼠状态与细胞类型,现在我们先在两种因素并不存在相互作用的假设下拟合线性模型。

> group <- as.character(group)
> group
 [1] "basal.virgin"     "basal.virgin"     "basal.pregnant"   "basal.pregnant"   "basal.lactate"   
 [6] "basal.lactate"    "luminal.virgin"   "luminal.virgin"   "luminal.pregnant" "luminal.pregnant"
[11] "luminal.lactate"  "luminal.lactate" 

group的分组信息包括了状态和细胞类型,因此我们使用strsplit函数,以‘.’为分隔把细胞类型信息和状态信息分别提取出来。

> type <- sapply(strsplit(group, ".", fixed=T), function(x) x[1])
> status <- sapply(strsplit(group, ".", fixed=T), function(x) x[2])
> type
 [1] "basal"   "basal"   "basal"   "basal"   "basal"   "basal"   "luminal" "luminal" "luminal"
[10] "luminal" "luminal" "luminal"
> status
 [1] "virgin"   "virgin"   "pregnant" "pregnant" "lactate"  "lactate"  "virgin"   "virgin"  
 [9] "pregnant" "pregnant" "lactate"  "lactate" 

成功提取后,构建分组信息矩阵

> design <- model.matrix(~ type + status)
> design
   (Intercept) typeluminal statuspregnant statusvirgin
1            1           0              0            1
2            1           0              0            1
3            1           0              1            0
4            1           0              1            0
5            1           0              0            0
6            1           0              0            0
7            1           1              0            1
8            1           1              0            1
9            1           1              1            0
10           1           1              1            0
11           1           1              0            0
12           1           1              0            0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$type
[1] "contr.treatment"

attr(,"contrasts")$status
[1] "contr.treatment"

Estimating the dispersion

> dgeObj <- estimateCommonDisp(dgeObj)
#Then we estimate gene-wise dispersion estimates, allowing a possible trend with averge count size:
> dgeObj <- estimateGLMTrendedDisp(dgeObj)
> dgeObj <- estimateTagwiseDisp(dgeObj)
 #Plot the estimated dispersions
> plotBCV(dgeObj)

Testing for differential expression

> fit <- glmFit(dgeObj, design)
# 看一眼系数
> head(coef(fit))
       (Intercept) typeluminal statuspregnant statusvirgin
497097  -11.187922 -7.58804851     -0.7085514  -0.09305118
20671   -12.715063 -1.85287334      0.2269001   0.49554506
27395   -11.221391  0.56368066     -0.1415910  -0.29221577
18777   -10.146793  0.08280255     -0.1845489  -0.48795441
21399    -9.909825 -0.24195503      0.1753606   0.13494615
58175   -16.310131  3.09936215      1.1975518   0.84742701
# Conduct likelihood ratio tests for luminal vs basal and show the top genes:
> lrt.BvsL <- glmLRT(fit, coef=2)

# 查看前10个最显著的差异表达基因
> topTags(lrt.BvsL)
Coefficient:  typeluminal 
           logFC    logCPM       LR       PValue          FDR
110308 -8.940579 10.264297 24.89789 6.044844e-07 0.0004377961
50916  -8.636503  5.749781 24.80037 6.358512e-07 0.0004377961
12293  -8.362247  6.794788 24.68526 6.749827e-07 0.0004377961
56069  -8.419433  6.124377 24.41532 7.764861e-07 0.0004377961
24117  -9.290691  6.757163 24.32506 8.137331e-07 0.0004377961
12818  -8.216790  8.172247 24.24233 8.494462e-07 0.0004377961
22061  -8.034712  7.255370 24.16987 8.820135e-07 0.0004377961
12797  -9.001419  9.910795 24.12854 9.011487e-07 0.0004377961
50706  -7.697022 10.809629 24.06926 9.293193e-07 0.0004377961
237979 -8.167451  5.215921 24.03528 9.458678e-07 0.0004377961

最后,保存差异分析的结果(实际分析时可不保存,可以直接代码分析)

save(lrt.BvsL,dgeObj,group,file="DE.Rdata")
上一篇下一篇

猜你喜欢

热点阅读