RNA-Seq(9):使用GSEA做GO/KEGG富集分析
2022-02-25 本文已影响0人
Z_bioinfo
最广为人知的富集分析做法是把上调、下调基因分别或者合并,拿来做GO和KEGG富集分析。经常有一些数据集,拿差异基因做得不到结果,那是因为确实富集不到任何通路,是正常的。不妨试试GSEA,不是拿差异基因,而是拿全部基因作为输入。
GSEA与GO,KEGG分析区别:GO,KEGG分析更加依赖差异基因,实则是对一部分基因的分析 (忽略差异不显著的基因),而GSEA是从全体基因的表达矩阵中找出具有协同差异 (concordant differences)的基因集,故能兼顾差异较小的基因
GO,KEGG富集是定性的分析,GSEA考虑到了表达或其它度量水平的值的影响。GSEA分析不需要指定阈值(p值或FDR)来筛选差异基因,在没有经验存在的情况下分析我们感兴趣的基因集,而这个基因集不一定是显著差异表达的基因。GSEA分析可以将那些GO/KEGG富集分信息中容易遗漏掉的差异表达不显著却有着重要生物学意义的基因包含在内。
另外,对于时间序列数据或样品有定量属性时,GSEA的优势会更明显,不需要每个分组分别进行富集,直接对整体进行处理。
Gene set enrichment analysis, 基因集富集分析
数据准备:直接用之前的DESeq2差异分析后的数据来演示,只需要ENTREZID和foldchange (或logFC)两列。
开始分析:导入要用到的R包
library(AnnotationDbi)
library(org.Hs.eg.db) 基因注释需要
library(clusterProfiler)
library(stringr)
library(enrichplot)#画图需要
library(stats)
library(dplyr)
数据准备,制作geneList
以0.05为阈值,选取DESeq2获得的差异基因
df <-subset(res, padj < 0.05 & abs(log2FoldChange) > 1)提取差异基因,因为abs(log2FoldChange) > 2在KEGG富集分析中富集不到结果,所以条件降低
df = data.frame(df)
df <- df %>% dplyr::select(log2FoldChange,external_gene_name,entrez) %>% 去掉NA filter(entrez!="NA") %>% 去掉重复 distinct(entrez,.keep_all = T)
制作genelist的三部曲
## 1.获取基因logFC
geneList <- df$log2FoldChange
## 2.命名
names(geneList) = df$entrez
## 3.排序很重要
geneList = sort(geneList, decreasing = TRUE)
看一下这个genelist,增加感性的理解
head(geneList)
247 7704 10218 401613 5625 3000
9.625966 7.180409 5.803614 5.528904 5.394581 5.377482
GSEA分析
1.完成KEGG的GSEA分析
gseaKEGG <- gseKEGG(geneList = geneList,organism = 'hsa',
minGSSize = 20,pvalueCutoff = 0.05,verbose = FALSE)
gseaKEGG@result
ID Description setSize
hsa04060 hsa04060 Cytokine-cytokine receptor interaction 25
hsa05165 hsa05165 Human papillomavirus infection 23
hsa01100 hsa01100 Metabolic pathways 78
hsa04151 hsa04151 PI3K-Akt signaling pathway 23
hsa05200 hsa05200 Pathways in cancer 45
hsa04010 hsa04010 MAPK signaling pathway 22
hsa04360 hsa04360 Axon guidance 22
enrichmentScore NES pvalue p.adjust qvalues rank
hsa04060 -0.4346572 -1.9903532 0.00205739 0.01440173 0.01299404 236
hsa05165 0.3294802 1.3562073 0.11940299 0.41791045 0.37706206 136
hsa01100 0.2029274 1.1263490 0.31649331 0.58913765 0.53155276 59
hsa04151 0.2658833 1.0944297 0.33665008 0.58913765 0.53155276 136
hsa05200 0.1908069 0.9351458 0.55161787 0.77226502 0.69678047 232
hsa04010 0.2025315 0.8116001 0.71140940 0.82997763 0.74885200 69
hsa04360 -0.1579634 -0.6940100 0.83251232 0.83251232 0.75113893 668
leading_edge
hsa04060 tags=72%, list=30%, signal=52%
hsa05165 tags=39%, list=17%, signal=33%
hsa01100 tags=12%, list=7%, signal=12%
hsa04151 tags=30%, list=17%, signal=26%
hsa05200 tags=40%, list=29%, signal=30%
hsa04010 tags=18%, list=9%, signal=17%
hsa04360 tags=100%, list=85%, signal=16%
core_enrichment
hsa04060 8200/4803/51554/130399/4982/7042/10673/8744/1230/9518/83729/6387/3592/146433/3976/9966/85480/6355
hsa05165 8515/7855/8325/2308/1311/3908/1285/1286/7057
hsa01100 247/5625/3000/55301/10840/2878/4128/2752/8707
hsa04151 8515/27006/1311/3908/1285/1286/7057
hsa05200 7704/54567/27006/7855/8325/2308/3908/1285/1286/10235/5336/5295/7296/23604/7048/5881/4616/8202
hsa04010 783/1843/27006/3306
hsa04360 7225/151449/5336/5295/5881/3983/2043/2051/85464/219699/6091/84552/10371/80031/2049/2044/8503/64101/2048/6387/815
ID 代表KEGG中的信号通路
Description 对信号通路的描述
setSize 该信号通路的基因个数
enrichmentScore 富集分数,也就是ES
NES 标准化以后的ES,全称normalized enrichment score、
qvalues ,或者说FDR q-val(false discovery rate)错误发现率
rank 排名
core_enrichment,富集该目的通路的基因列表。
作图展示富集分布图
> library(ggplot2)
>dotplot(gseaKEGG,showCategory=12,split=".sign")+facet_grid(~.sign)
image.png
对Cytokine-cytokine receptor interaction感兴趣,使用gseaplot2把他画出来
library(enrichplot)
pathway.id = "hsa04060"
gseaplot2(gseaKEGG,
color = "red",
geneSetID = pathway.id,
pvalue_table = T)
image.png
pathview 展示
我们现在知道Cytokine-cytokine receptor interaction setSize enrichmentScore是被抑制的,如果还想看一下这个通路里面的基因是如何变化的,应该怎么办呢,pathview 可以帮到我们。
library(pathview)
pathway.id = "hsa04060"
pv.out <- pathview(gene.data = geneList,
pathway.id = pathway.id,
species = "hsa")
改变一下参数,可以得到另外一种构图
pv.out <- pathview(gene.data = geneList,
pathway.id = pathway.id,
species = "hsa",
kegg.native = F)
image.png
2.GO分析
ego <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "BP", pvalueCutoff = 0.05, pAdjustMethod = "BH")
head(ego@result,3)
ID
GO:0051962 GO:0051962
GO:0051965 GO:0051965
GO:1901890 GO:1901890
Description
GO:0051962 positive regulation of nervous system development
GO:0051965 positive regulation of synapse assembly
GO:1901890 positive regulation of cell junction assembly
setSize enrichmentScore NES pvalue
GO:0051962 50 -0.5640653 -2.405584 3.57729e-07
GO:0051965 11 -0.8105886 -2.320797 8.35403e-06
GO:1901890 19 -0.6792836 -2.265343 3.74202e-05
p.adjust qvalues rank
GO:0051962 0.0009179327 0.0008589262 363
GO:0051965 0.0107182200 0.0100292324 320
GO:1901890 0.0320067451 0.0299492905 320
leading_edge
GO:0051962 tags=40%, list=14%, signal=35%
GO:0051965 tags=73%, list=12%, signal=64%
GO:1901890 tags=47%, list=12%, signal=42%
core_enrichment
GO:0051962 22891/54674/6091/7422/4803/26011/2049/1382/2048/6387/146433/57633/40/23767/8153/3976/1959/84189/7472/26045
GO:0051965 54674/2049/2048/57633/40/23767/84189/26045
GO:1901890 54674/7422/2049/2048/57633/40/23767/84189/26045
#可视化
dotplot(ego, showCategory=30)