有参转录组学习六：差异表达分析

2019-05-17 本文已影响0人颤抖吧__小虫子

Author：ligc

Date：19/5/18

DEG differential expression genes

input1 input2

GeneA: input1-reads count input2-reads count

fold change = input2-reads count / input1-reads count

问题1. 测序量是否一致：测序量，矫正测序量，normalization

1.1 quantile，1.2 genometry， 1.3 traditional

问题2 怎么去衡量，哪个才是真正显著的基因？

200000 -> 250000 ; 2000 -> 2500； 200 -> 250
哪个才是显著差异的？

基本假设：

1. 大部分的gene表达量不变；
2. 高表达的gene表达量不变；

1.为什么大多数基因表达量不变？

p-value

H0: 两者没有差异；
H1：两者有差异；
统计量x，a=0.05，肯定H0或者否定H1；
input1，2000，input2, 2500，p-value = 0.001
p-value < 0.05 , 小概率事件；单次实验不发生；否定H0；有差异

为什么高表达的gene表达量不变？

因为如果高表达基因表达量发生变化则会影响其他基因表达量也发生变化

input1; geneA,B,C 200 = 150 + 25 + 25
input2; geneA,B,C 200 = 130 + 35 + 35

问题3

如果高表达的基因变了怎么办？

如果大多数表达的基因变了怎么办？

3.1 掺入spike-in（针对以上两种情况）

3.2 house keeping gene进行校正（针对高表达基因）

bioinformatics

核心是为了降噪！而不是提高噪音！
we do not need bioinfor magics

cuffdiff pipeline

FASTQ-reads -> mapping (HISAT2,tophat2) -> BAM -> cuffdiff -> DEGs(FPKM)

DESeq2 pipeline

FASTQ-reads -> mapping (HISAT2,tophat2) -> BAM -> Gene Raw Reads Count -> find DEGs in R (No FPKM)

泊松分布

不太常见的事情发生的次数

比如被雷劈

lambda， E，Var

负二项分布

两盒火柴，A,B，随机取光一盒火柴所用的次数X
sum(A,B) >= X >= min(A,B)
类比于：
reads落在gene1上的概率，即reads_count符合negative binomial distribution
gene1, non-gene1
TPM = 1E6 * FPKM / total FPKM

DESeq2

基本流程：

输入数据：

表达矩阵、样品信息矩阵、差异比较矩阵

步骤：

构建一个dds(DESeqDataSet)的对象
利用DESeq函数进行标准化处理
用result函数来提取差异比较的结果

Import

构建dds矩阵

dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design= ~ batch + condition)

表达矩阵：即countData，就是通过之前的HTSeq-count生成的reads-count计算融合的矩阵。行为基因名，列为样品名，值为reads或fragment的整数。
样品信息矩阵：即代码中的colData，它的类型是一个dataframe（数据框），第一列是样品名称，第二列是样品的处理情况（对照还是处理等）。可以从表达矩阵中导出或是自己单独建立。
差异比较矩阵：即代码中的design，差异比较矩阵就是告诉差异分析函数哪些是对照，哪些是处理。

library(tidyr)
library(DESeq2)
rm(list = ls())
getwd()
list.files()
count_tab = read.table("merge_count.matrix",header =T)# ,row.name =1

count_tab <- separate(count_tab,gene_id,into = "gene_id",sep = "[.]")

rownames(count_tab) = count_tab$gene_id
count_tab = count_tab[,-c(1)]

colData = read.table("SRR_Acc_List_sample.txt",header = T)

colData$condition = factor(colData$condition,c("control","Akap95"))

dds <- DESeqDataSetFromMatrix(countData = count_tab,
                              colData = colData,
                              design= ~ condition)
dds <- DESeq(dds)
resultsNames(dds) # lists the coefficients
res <- results(dds, name="condition_Akap95_vs_control")
res = res[order(res$padj), ]
resDF = as.data.frame(res)
resDF$gene_id = rownames(resDF)
resDF = resDF[ ,c(7,1,2,3,4,5,6)]
##cbind
#resDF <- cbind("gene_id"=rownames(resDF),resDF)
DEG_list <- subset(resDF,padj<0.05 &abs(log2FoldChange)>1)
write.table(DEG_list,file = "mouse_DEG_list",row.names = F,sep = "\t",quote = F)
write.table(resDF,file = "mouse_Akap95_DEG.tab",sep = "\t",quote = FALSE,row.names = FALSE)
# or to shrink log fold changes association with condition:
res <- lfcShrink(dds, coef="condition_Akap95_vs_control", type="apeglm")
## PCA plot
library(ggplot2)
vsd <- vst(dds, blind=FALSE)
pcaData <- plotPCA(vsd, intgroup=c("condition"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
plot_Fig = ggplot(pcaData, aes(PC1, PC2, color=condition)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()
ggsave(plot_Fig,filename = "mouse_DEseq2_PCA.pdf")
# plotPCA(vsd, intgroup=c("condition", "type"))
## MA plot
pdf("mouse_MAplot.pdf")
plotMA(res, ylim=c(-2,2))
dev.off()

MA_plot
mean of normalized counts 表示经过标准化后的平均reads count。
log fold change 基因表达量的倍数变化。

PCA

heatmap

参考文章：

1.https://www.jianshu.com/p/26511d3360c8
2.https://www.jianshu.com/p/5f94ae79f298
3.https://www.bilibili.com/video/av19929350

有参转录组学习六：差异表达分析

Author：ligc

Date：19/5/18

DEG differential expression genes

input1 input2

GeneA: input1-reads count input2-reads count

问题1. 测序量是否一致：测序量，矫正测序量，normalization

1.1 quantile，1.2 genometry， 1.3 traditional

问题2 怎么去衡量，哪个才是真正显著的基因？

基本假设：

1.为什么大多数基因表达量不变？

p-value

为什么高表达的gene表达量不变？

因为如果高表达基因表达量发生变化则会影响其他基因表达量也发生变化

问题3

如果高表达的基因变了怎么办？

如果大多数表达的基因变了怎么办？

3.1 掺入spike-in（针对以上两种情况）

3.2 house keeping gene进行校正（针对高表达基因）

bioinformatics

cuffdiff pipeline

DESeq2 pipeline

泊松分布

不太常见的事情发生的次数

比如被雷劈

负二项分布

DESeq2

基本流程：

输入数据：

步骤：

构建dds矩阵

参考文章：

猜你喜欢

热点阅读

有参转录组学习六：差异表达分析

Author：ligc

Date：19/5/18

DEG differential expression genes

input1 input2

GeneA: input1-reads count input2-reads count

问题1. 测序量是否一致 ：测序量，矫正测序量，normalization

1.1 quantile，1.2 genometry， 1.3 traditional

问题2 怎么去衡量，哪个才是真正显著的基因？

基本假设：

1.为什么大多数基因表达量不变？

p-value

为什么高表达的gene表达量不变？

因为如果高表达基因表达量发生变化则会影响其他基因表达量也发生变化

问题3

如果高表达的基因变了怎么办？

如果大多数表达的基因变了怎么办？

3.1 掺入spike-in（针对以上两种情况）

3.2 house keeping gene进行校正（针对高表达基因）

bioinformatics

cuffdiff pipeline

DESeq2 pipeline

泊松分布

不太常见的事情发生的次数

比如被雷劈

负二项分布

DESeq2

基本流程：

输入数据：

步骤：

构建dds矩阵

参考文章：

猜你喜欢

热点阅读

问题1. 测序量是否一致：测序量，矫正测序量，normalization