clusterprofiler
功能富集分析方法
过表征分析 (over representation analysis, ORA)
基因富集分析 (gene set enrichment analysis, GSEA)
……ORA的方法……我們關心的是:有興趣的基因中(genes of interest),與某個基因集(gene set),共同基因有幾個(K值)……我們可以用超幾何分布(Hypergeometric distribution)或二項式分佈(binomial distribution)來計算觀察值k的機率。
……GSEA的概念……首先將高通量實驗所量測到的基因排序,排列的順序是根據實驗量測到的數值決定……GSEA採用一個稱random walk的方法,也就是從基因列表的頭走到尾,如果碰到是基因集的基因就加分,不是則扣分。走完一趟後,回頭看走到哪兒時,分數最高(或最低),這個分數就是所謂的enrichment score (ES)……GSEA利用permutation testing的方法,也就是隨機抓取同等數量的基因當基因集,並計算得到隨機的ES,去估算實際觀察到的ES的P值,如果P值小於所設定的統計條件,就可以確保這ES並不是隨機就會發生。
clusterProfiler 支持多种本体论/通路 (ontology/pathway) 的超几何检验和基因富集分析。并且包内的函数 enricher() 和 GSEA() 能够分别对用户定义的注释信息进行超几何检验及基因富集分析。
输入数据
对于过表征分析 (over representation analysis, ORA), 我们需要的是一个包含基因ID的向量,基因ID可以从差异表达分析获得(例如 DESeq2 包)。
对于基因富集分析 (gene set enrichment analysis, GSEA), 我们需要一个经排序的基因列表,在这里我们调用 DOSE 包中的示例数据 geneList .
The geneList contains three features:
numeric vector: fold change or other type of numerical variable
named vector: every number was named by the corresponding gene ID
sorted vector: number should be sorted in decreasing order
data(geneList, package="DOSE")
head(geneList)
# 4312 8318 10874 55143 55388 991
# 4.572613 4.514594 4.418218 4.144075 3.876258 3.677857
## 假设想得到 |log2(FC)|>2 的 DEGs.
gene <- names(geneList)[abs(geneList) > 2]
head(gene)
# [1] "4312" "8318" "10874" "55143" "55388" "991"
GO分析
支持的物种
GO分析(groupGO(), enrichGO(), gseGO())支持Bioconductor
提供的 OrgDb 中已有的20个物种。也可以通过AnnotationHub
在线检索并抓取 OrgDb.
GO分类
函数 groupGO()
的设计是基于在特定水平的GO分布,从而对基因进行分类。
library(org.Hs.eg.db)
转换ID,参数'gene'可以是OrgDb支持的任何ID形式
gene.df <- bitr(gene, fromType = "ENTREZID",
toType = c("ENSEMBL", "SYMBOL"),
OrgDb = org.Hs.eg.db)
head(gene.df)
# ENTREZID ENSEMBL SYMBOL
# 1 4312 ENSG00000196611 MMP1
# 2 8318 ENSG00000093009 CDC45
# 3 10874 ENSG00000109255 NMU
# 4 55143 ENSG00000134690 CDCA8
# 5 55388 ENSG00000065328 MCM10
# 6 991 ENSG00000117399 CDC20
OrgDb = org.Hs.eg.db,
ont = "CC",
level = 3, ## Specific GO Level.
readable = TRUE) ## the gene IDs will mapping to gene symbols.
head(ggo)
# ID Description Count GeneRatio
# GO:0005886 GO:0005886 plasma membrane 52 52/207
# GO:0005628 GO:0005628 prospore membrane 0 0/207
# GO:0005789 GO:0005789 endoplasmic reticulum membrane 8 8/207
# GO:0019867 GO:0019867 outer membrane 2 2/207
# # # # geneID
# GO:0005886 S100A9/MELK/S100A8/MARCO/ASPM/CXCL10/LAMP3/CEP55/UGT8/UBE2C/SLC7A5/CXCL9/FADS2/MSLN/IL1R2/KIF18A/S100P/GZMB/TRAT1/GABRP/AQP9/GPR19/SLC2A6/KIF20A/LAG3/NUDT1/CACNA1D/VSTM4/CORIN/KCNK15/CA12/KCNE4/HLA-DQA1/ADH1B/PDZK1/C7/ACKR1/COL17A1/PSD3/EMCN/SLC44A4/LRP2/NLGN4X/MAPT/ERBB4/CX3CR1/LAMP5/ABCA8/STEAP4/PTPRT/TMC5/CYBRD1
# GO:0005628 # # #
# GO:0005789 # # # FADS2/CDK1/CHODL/ITPR1/HLA-DQA1/CYP4F8/CYP4B1/FMO5
# GO:0019867 # # # MAOB/PGR
Level 1 provides the highest list coverage with the least amount of term specificity. With each increasing level coverage decreases while specificity increases so that level 5 provides the least amount of coverage with the highest term specificity. (Dennis, Glynn, et al)
GO ORA
ego <- enrichGO(gene = gene,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(ego)
# ID Description GeneRatio BgRatio pvalue p.adjust qvalue
# GO:0000779 GO:0000779 condensed chromosome, centromeric region 14/197 85/11434 1.662431e-10 3.802012e-08 3.572360e-08
# GO:0000776 GO:0000776 kinetochore 15/197 104/11434 2.551686e-10 3.802012e-08 3.572360e-08
# GO:0000775 GO:0000775 chromosome, centromeric region 17/197 147/11434 5.495085e-10 5.458451e-08 5.128746e-08
# GO:0000793 GO:0000793 condensed chromosome 17/197 154/11434 1.141189e-09 8.501857e-08 7.988322e-08
# # geneID Count
# GO:0000779 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/CDT1/BIRC5/NCAPG/AURKB/CCNB1 14
# GO:0000776 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/AURKB/CCNB1 15
# GO:0000775 CDCA8/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/CCNB1 17
# GO:0000793 CENPE/NDC80/TOP2A/NCAPH/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/CDT1/BIRC5/NCAPG/AURKB/CHEK1/CCNB1 17
如果直接调用OrgDb
中的 ID,则需要在参数中确定 'KeyType'.
ego2 <- enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
通过设置参数readable=Ture
或 函数 setReadable()
可以将 Gene ID
转换为 symbol
.
ego2 <- setReadable(ego2, OrgDb = org.Hs.eg.db)
before:
图片.png
after:
图片.png
5.3.1 移除特定的 GO term 或 GO level
函数 dropGO 可以在由 enrichGO 得到的结果中移除特定的 GO term 或 GO level.
5.3.2 将结果限定在特定的 Go level
enrichGO() 不含设置 GO level 的参数,而函数 gofilter() 可以将结果限定在特定的 GO level.
5.3.3 GO term 去冗余
rmredunego <- simplify(ego, cutoff=0.7, by="p.adjust", select_fun=min)
before:
after:
GO GSEA
ego3 <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
nPerm = 1000, ## 排列数
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE) ## 不输出结果
KEGG分析
函数 search_kegg_organism()
可以帮助搜索 KEGG 数据库支持的物种。
物种缩写戳这里:https://www.genome.jp/kegg/catalog/org_list.html
earch_kegg_organism('ece', by='kegg_code')
# kegg_code scientific_name common_name
# 366 ece Escherichia coli O157:H7 EDL933 (EHEC) <NA>
ecoli <- search_kegg_organism('Escherichia coli', by='scientific_name')
dim(ecoli)
# [1] 65 3
head(ecoli)
# kegg_code scientific_name common_name
# 361 eco Escherichia coli K-12 MG1655 <NA>
# 362 ecj Escherichia coli K-12 W3110 <NA>
# 363 ecd Escherichia coli K-12 DH10B <NA>
# 364 ebw Escherichia coli BW2952 <NA>
# 365 ecok Escherichia coli K-12 MDS42 <NA>
# 366 ece Escherichia coli O157:H7 EDL933 (EHEC) <NA>
search_kegg_organism('hsa', by='kegg_code')
# kegg_code scientific_name common_name
# 1 hsa Homo sapiens human
dim(hsapiens)
[1] 1 3
hsapiens <- search_kegg_organism('Homo sapiens', by='scientific_name')
hsapiens
# kegg_code scientific_name common_name
# 1 hsa Homo sapiens human
KEGG ORA
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
kk <- enrichKEGG(gene = gene,
organism = 'hsa',
pvalueCutoff = 0.05)
head(kk)
# ID
# hsa04110 hsa04110
# hsa04114 hsa04114
# hsa04218 hsa04218
# hsa04061 hsa04061
# hsa03320 hsa03320
# hsa04914 hsa04914
# Description
# hsa04110 Cell cycle
# hsa04114 Oocyte meiosis
# hsa04218 Cellular senescence
# hsa04061 Viral protein interaction with cytokine and cytokine receptor
# hsa03320 PPAR signaling pathway
# hsa04914 Progesterone-mediated oocyte maturation
# GeneRatio BgRatio pvalue p.adjust qvalue
# hsa04110 11/93 124/7861 1.966898e-07 3.815781e-05 3.726753e-05
# hsa04114 10/93 125/7861 1.907932e-06 1.850694e-04 1.807515e-04
# hsa04218 10/93 160/7861 1.742571e-05 1.048593e-03 1.024128e-03
# hsa04061 8/93 100/7861 2.162048e-05 1.048593e-03 1.024128e-03
# hsa03320 7/93 76/7861 2.906892e-05 1.127874e-03 1.101559e-03
# hsa04914 7/93 99/7861 1.587827e-04 5.133975e-03 5.014191e-03
# geneID Count
# hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232 11
# hsa04114 991/9133/983/4085/51806/6790/891/9232/3708/5241 10
# hsa04218 2305/4605/9133/890/983/51806/1111/891/776/3708 10
# hsa04061 3627/10563/6373/4283/6362/6355/9547/1524 8
# hsa03320 4312/9415/9370/5105/2167/3158/5346 7
# hsa04914 9133/890/983/4085/6790/891/5241 7
KEGG GSEA
kk2 <- gseKEGG(geneList = geneList,
organism = 'hsa',
nPerm = 1000,
minGSSize = 120,
pvalueCutoff = 0.05,
verbose = FALSE)
head(kk2,n=3)
# ID Description setSize enrichmentScore
# hsa04151 hsa04151 PI3K-Akt signaling pathway 322 -0.3482755
# hsa04510 hsa04510 Focal adhesion 188 -0.4188582
# hsa03013 hsa03013 RNA transport 131 0.4116488
# NES pvalue p.adjust qvalues rank
# hsa04151 -1.498250 0.001278772 0.01648352 0.01098901 1997
# hsa04510 -1.712296 0.001390821 0.01648352 0.01098901 2183
# hsa03013 1.750526 0.003067485 0.01648352 0.01098901 3383
# leading_edge
# hsa04151 tags=23%, list=16%, signal=20%
# hsa04510 tags=27%, list=17%, signal=23%
# hsa03013 tags=40%, list=27%, signal=29%
# core_enrichment
# hsa04151 2252/7059/92579/5563/5295/6794/1288/7010/3910/3371/3082/1291/4602/3791/1027/90993/3441/3643/1129/2322/1975/7450/596/3685/1942/2149/1280/4804/3675/595/2261/7248/2246/4803/3912/1902/1278/1277/2846/2057/1293/2247/55970/5618/7058/10161/56034/3693/4254/3480/4908/5159/1292/3908/2690/3909/8817/9223/4915/3551/2791/63923/3913/9863/3667/1287/3679/7060/3479/80310/1311/5105/2066/1101
# hsa04510 5228/7424/1499/4636/83660/7059/5295/1288/23396/3910/3371/3082/1291/394/3791/7450/596/3685/1280/3675/595/2318/3912/1793/1278/1277/1293/10398/55742/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/1287/3679/7060/3479/10451/80310/1311/1101
# hsa03013 10460/1978/55110/54913/9688/8894/11260/10799/9631/4116/5042/8761/6396/23165/8662/10248/55706/79833/9775/29107/23636/5905/9513/5901/10775/10557/4927/79902/1981/26986/11171/10762/8480/8891/11097/26019/10940/4686/9972/81929/10556/3646/9470/387082/1977/57122/8563/7514/79023/3837/9818/56000
重头戏——功能富集结果的可视化
这时要用到另一个包:enrichplot
,它可以实现多种可视化,更好地说明富集分析结果。
library(enrichplot)
Bar plot
oragene <- enrichDGN(gene)
barplot(oragene,showCategory = 20)
## 该函数默认参数为:
enrichDGN(gene, pvalueCutoff = 0.05, pAdjustMethod = "BH", universe,
minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2,
readable = FALSE)
图片.png
Dot Plot
gseagene <- gseNCG(geneList, nPerm=10000)
p1 <- dotplot(oragene, showCategory=30) + ggtitle("dotplot for ORA")
p2 <- dotplot(gseagene, showCategory=30) + ggtitle("dotplot for GSEA")
plot_grid(p1, p2, ncol=2)
## 一个瘦一个胖好丑啊
图片.png
Gene-Concept Network
前面的 两款神器 两个函数,都只能展示富集最显著的 GO term,而函数 cnetplot()
可以将基因与生物学概念 (e.g.* GO terms or KEGG pathways) 的关系绘制成网状图。
oragnx <- setReadable(oragene, 'org.Hs.eg.db', 'ENTREZID') ## 将 Gene ID 转换为 symbol
cnetplot(orangx, foldChange=geneList)
图片.png
cnetplot(oragnx, categorySize="pvalue", foldChange=geneList) ## categorySize 可以是 "pvalue" 或 "geneNum"
图片.png
cnetplot(oragnx, foldChange=geneList, circular = TRUE, colorEdge = TRUE) ## 圆形布局,给线条上色
图片.png
Heatmap
热图能够简化结果,更容易分辨表达模式 (expression patterns) 。
heatplot(oragnx, foldChange=geneList)
Enrichment Map
Enrichment Map
可以将富集条目和重叠的基因集整合为一个网络图,相互重叠的基因集则趋向于成簇,从而易于分辨功能模型。
emapplot(oragene)
图片.png
UpSet Plot
函数 upsetplot()
是 cneplot()
的一种替代方案,用于可视化基因与基因集间的复杂关联,而upsetplot()
更着重于不同基因集间基因的重叠情况。
upsetplot(oragene)
图片.png
GSEA结果的表达分布叠嶂图 (ridgeline plot for expression distribution of GSEA result)
更直观地展示上调/下调的通路。
ridgeplot(gseagene)
图片.png
富集条目在 PubMed 上的趋势
函数pmcplot()
可以就某些富集 条目/通路 绘制其在 PubMed 上的 数量/比例 折线图。
terms <- oragene$Description[1:3]
p <- pmcplot(terms, 2012:2019) ## 默认为proportion=TRUE
p2 <- pmcplot(terms, 2012:2019, proportion=FALSE)
plot_grid(p, p2, ncol=2)
图片.png
goplot
函数goplot()
接受enrichGO()
的输出,并将其可视化。
browseKEGG
函数browseKEGG
可以帮你打开浏览器,嗯。
browseKEGG(kk, 'hsa04110')
pathview 包里的 上帝视角 PATHVIEW!
library(pathview)
hsa04110 <- pathview(gene.data = geneList,
pathway.id = "hsa04110",
species = "hsa",
limit = list(gene=max(abs(geneList)), cpd=1)) ## cpd, compound
# Info: Downloading xml files for hsa04110, 1/1 pathways..
# Info: Downloading png files for hsa04110, 1/1 pathways..
# 'select()' returned 1:1 mapping between keys and columns
# Info: Working in directory /YOUR PATH/Project/clusterProfiler
# Info: Writing image file hsa04110.pathview.png
图片.png
作者:鬼笔环钛汰太C_137
链接:https://www.jianshu.com/p/e133ab3169fa
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。