R学习与可视化注释和富集单细胞

clusterprofiler

2021-05-22  本文已影响0人  MLD_TRNA

功能富集分析方法

过表征分析 (over representation analysis, ORA)

基因富集分析 (gene set enrichment analysis, GSEA)

        ……ORA的方法……我們關心的是:有興趣的基因中(genes of interest),與某個基因集(gene set),共同基因有幾個(K值)……我們可以用超幾何分布(Hypergeometric distribution)或二項式分佈(binomial distribution)來計算觀察值k的機率。
        ……GSEA的概念……首先將高通量實驗所量測到的基因排序,排列的順序是根據實驗量測到的數值決定……GSEA採用一個稱random walk的方法,也就是從基因列表的頭走到尾,如果碰到是基因集的基因就加分,不是則扣分。走完一趟後,回頭看走到哪兒時,分數最高(或最低),這個分數就是所謂的enrichment score (ES)……GSEA利用permutation testing的方法,也就是隨機抓取同等數量的基因當基因集,並計算得到隨機的ES,去估算實際觀察到的ES的P值,如果P值小於所設定的統計條件,就可以確保這ES並不是隨機就會發生。

clusterProfiler 支持多种本体论/通路 (ontology/pathway) 的超几何检验和基因富集分析。并且包内的函数 enricher() 和 GSEA() 能够分别对用户定义的注释信息进行超几何检验及基因富集分析。

输入数据

对于过表征分析 (over representation analysis, ORA), 我们需要的是一个包含基因ID的向量,基因ID可以从差异表达分析获得(例如 DESeq2 包)。

对于基因富集分析 (gene set enrichment analysis, GSEA), 我们需要一个经排序的基因列表,在这里我们调用 DOSE 包中的示例数据 geneList .

The geneList contains three features:

    numeric vector: fold change or other type of numerical variable
    named vector: every number was named by the corresponding gene ID
    sorted vector: number should be sorted in decreasing order
data(geneList, package="DOSE")
head(geneList)
# 4312     8318    10874    55143    55388      991 
# 4.572613 4.514594 4.418218 4.144075 3.876258 3.677857 

## 假设想得到 |log2(FC)|>2 的 DEGs.
gene <- names(geneList)[abs(geneList) > 2]
head(gene)
# [1] "4312"  "8318"  "10874" "55143" "55388" "991"

GO分析

支持的物种

GO分析(groupGO(), enrichGO(), gseGO())支持Bioconductor提供的 OrgDb 中已有的20个物种。也可以通过AnnotationHub在线检索并抓取 OrgDb.

GO分类

函数 groupGO()的设计是基于在特定水平的GO分布,从而对基因进行分类。
library(org.Hs.eg.db)

转换ID,参数'gene'可以是OrgDb支持的任何ID形式

gene.df <- bitr(gene, fromType = "ENTREZID", 
                toType = c("ENSEMBL", "SYMBOL"),
                OrgDb = org.Hs.eg.db) 
head(gene.df)
#   ENTREZID         ENSEMBL SYMBOL
# 1     4312 ENSG00000196611   MMP1
# 2     8318 ENSG00000093009  CDC45
# 3    10874 ENSG00000109255    NMU
# 4    55143 ENSG00000134690  CDCA8
# 5    55388 ENSG00000065328  MCM10
# 6      991 ENSG00000117399  CDC20
               OrgDb    = org.Hs.eg.db,
               ont      = "CC", 
               level    = 3,     ## Specific GO Level.
               readable = TRUE)  ## the gene IDs will mapping to gene symbols.
head(ggo)
#                    ID                    Description Count GeneRatio
# GO:0005886 GO:0005886                plasma membrane    52    52/207
# GO:0005628 GO:0005628              prospore membrane     0     0/207
# GO:0005789 GO:0005789 endoplasmic reticulum membrane     8     8/207
# GO:0019867 GO:0019867                 outer membrane     2     2/207
#                                                                                         #                                                                                         #                                                                                         #                                                          geneID
# GO:0005886 S100A9/MELK/S100A8/MARCO/ASPM/CXCL10/LAMP3/CEP55/UGT8/UBE2C/SLC7A5/CXCL9/FADS2/MSLN/IL1R2/KIF18A/S100P/GZMB/TRAT1/GABRP/AQP9/GPR19/SLC2A6/KIF20A/LAG3/NUDT1/CACNA1D/VSTM4/CORIN/KCNK15/CA12/KCNE4/HLA-DQA1/ADH1B/PDZK1/C7/ACKR1/COL17A1/PSD3/EMCN/SLC44A4/LRP2/NLGN4X/MAPT/ERBB4/CX3CR1/LAMP5/ABCA8/STEAP4/PTPRT/TMC5/CYBRD1
# GO:0005628                                                                             #                                                                                         #                                                                                         #                                                                 
# GO:0005789                                                                             #                                                                                         #                                                                                         #               FADS2/CDK1/CHODL/ITPR1/HLA-DQA1/CYP4F8/CYP4B1/FMO5
# GO:0019867                                                                             #                                                                                         #                                                                                         #                                                         MAOB/PGR
Level 1 provides the highest list coverage with the least amount of term specificity. With each increasing level coverage decreases while specificity increases so that level 5 provides the least amount of coverage with the highest term specificity. (Dennis, Glynn, et al)

GO ORA

ego <- enrichGO(gene          = gene,
                universe      = names(geneList),
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
        readable      = TRUE)
head(ego)
#                    ID                              Description GeneRatio   BgRatio       pvalue     p.adjust       qvalue
# GO:0000779 GO:0000779 condensed chromosome, centromeric region    14/197  85/11434 1.662431e-10 3.802012e-08 3.572360e-08
# GO:0000776 GO:0000776                              kinetochore    15/197 104/11434 2.551686e-10 3.802012e-08 3.572360e-08
# GO:0000775 GO:0000775           chromosome, centromeric region    17/197 147/11434 5.495085e-10 5.458451e-08 5.128746e-08
# GO:0000793 GO:0000793                     condensed chromosome    17/197 154/11434 1.141189e-09 8.501857e-08 7.988322e-08
#                                                                                         #                  geneID Count
# GO:0000779                   CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/CDT1/BIRC5/NCAPG/AURKB/CCNB1    14
# GO:0000776              CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/AURKB/CCNB1    15
# GO:0000775  CDCA8/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/CCNB1    17
# GO:0000793 CENPE/NDC80/TOP2A/NCAPH/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/CDT1/BIRC5/NCAPG/AURKB/CHEK1/CCNB1    17

如果直接调用OrgDb中的 ID,则需要在参数中确定 'KeyType'.

ego2 <- enrichGO(gene         = gene.df$ENSEMBL,
                OrgDb         = org.Hs.eg.db,
                keyType       = 'ENSEMBL',
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)

通过设置参数readable=Ture 或 函数 setReadable()可以将 Gene ID转换为 symbol.

ego2 <- setReadable(ego2, OrgDb = org.Hs.eg.db)

before:


图片.png

after:


图片.png

5.3.1 移除特定的 GO term 或 GO level

函数 dropGO 可以在由 enrichGO 得到的结果中移除特定的 GO term 或 GO level.
5.3.2 将结果限定在特定的 Go level

enrichGO() 不含设置 GO level 的参数,而函数 gofilter() 可以将结果限定在特定的 GO level.
5.3.3 GO term 去冗余
rmredunego <- simplify(ego, cutoff=0.7, by="p.adjust", select_fun=min)

before:

after:

GO GSEA

ego3 <- gseGO(geneList     = geneList,
              OrgDb        = org.Hs.eg.db,
              ont          = "CC",
              nPerm        = 1000,   ## 排列数
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.05,
              verbose      = FALSE)  ## 不输出结果

KEGG分析

函数 search_kegg_organism()可以帮助搜索 KEGG 数据库支持的物种。

物种缩写戳这里:https://www.genome.jp/kegg/catalog/org_list.html

earch_kegg_organism('ece', by='kegg_code')   
#     kegg_code                        scientific_name common_name
# 366       ece Escherichia coli O157:H7 EDL933 (EHEC)        <NA>
ecoli <- search_kegg_organism('Escherichia coli', by='scientific_name')
dim(ecoli)
# [1] 65  3
head(ecoli)
#     kegg_code                        scientific_name common_name
# 361       eco           Escherichia coli K-12 MG1655        <NA>
# 362       ecj            Escherichia coli K-12 W3110        <NA>
# 363       ecd            Escherichia coli K-12 DH10B        <NA>
# 364       ebw                Escherichia coli BW2952        <NA>
# 365      ecok            Escherichia coli K-12 MDS42        <NA>
# 366       ece Escherichia coli O157:H7 EDL933 (EHEC)        <NA>
search_kegg_organism('hsa', by='kegg_code')
#   kegg_code scientific_name common_name
# 1       hsa    Homo sapiens       human
dim(hsapiens)   
[1] 1 3
hsapiens <- search_kegg_organism('Homo sapiens', by='scientific_name')
hsapiens
#   kegg_code scientific_name common_name
# 1       hsa    Homo sapiens       human
KEGG ORA
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]

kk <- enrichKEGG(gene         = gene,
                 organism     = 'hsa',
                 pvalueCutoff = 0.05)
head(kk)
#                ID
# hsa04110 hsa04110
# hsa04114 hsa04114
# hsa04218 hsa04218
# hsa04061 hsa04061
# hsa03320 hsa03320
# hsa04914 hsa04914
#                                                            Description
# hsa04110                                                    Cell cycle
# hsa04114                                                Oocyte meiosis
# hsa04218                                           Cellular senescence
# hsa04061 Viral protein interaction with cytokine and cytokine receptor
# hsa03320                                        PPAR signaling pathway
# hsa04914                       Progesterone-mediated oocyte maturation
#          GeneRatio  BgRatio       pvalue     p.adjust       qvalue
# hsa04110     11/93 124/7861 1.966898e-07 3.815781e-05 3.726753e-05
# hsa04114     10/93 125/7861 1.907932e-06 1.850694e-04 1.807515e-04
# hsa04218     10/93 160/7861 1.742571e-05 1.048593e-03 1.024128e-03
# hsa04061      8/93 100/7861 2.162048e-05 1.048593e-03 1.024128e-03
# hsa03320      7/93  76/7861 2.906892e-05 1.127874e-03 1.101559e-03
# hsa04914      7/93  99/7861 1.587827e-04 5.133975e-03 5.014191e-03
#                                                      geneID Count
# hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232    11
# hsa04114    991/9133/983/4085/51806/6790/891/9232/3708/5241    10
# hsa04218     2305/4605/9133/890/983/51806/1111/891/776/3708    10
# hsa04061           3627/10563/6373/4283/6362/6355/9547/1524     8
# hsa03320                 4312/9415/9370/5105/2167/3158/5346     7
# hsa04914                    9133/890/983/4085/6790/891/5241     7

KEGG GSEA

kk2 <- gseKEGG(geneList     = geneList,
               organism     = 'hsa',
               nPerm        = 1000,
               minGSSize    = 120,
               pvalueCutoff = 0.05,
               verbose      = FALSE)
head(kk2,n=3)
#                ID                Description setSize enrichmentScore
# hsa04151 hsa04151 PI3K-Akt signaling pathway     322      -0.3482755
# hsa04510 hsa04510             Focal adhesion     188      -0.4188582
# hsa03013 hsa03013              RNA transport     131       0.4116488
#                NES      pvalue   p.adjust    qvalues rank
# hsa04151 -1.498250 0.001278772 0.01648352 0.01098901 1997
# hsa04510 -1.712296 0.001390821 0.01648352 0.01098901 2183
# hsa03013  1.750526 0.003067485 0.01648352 0.01098901 3383
#                            leading_edge
# hsa04151 tags=23%, list=16%, signal=20%
# hsa04510 tags=27%, list=17%, signal=23%
# hsa03013 tags=40%, list=27%, signal=29%
#                                                              core_enrichment
# hsa04151 2252/7059/92579/5563/5295/6794/1288/7010/3910/3371/3082/1291/4602/3791/1027/90993/3441/3643/1129/2322/1975/7450/596/3685/1942/2149/1280/4804/3675/595/2261/7248/2246/4803/3912/1902/1278/1277/2846/2057/1293/2247/55970/5618/7058/10161/56034/3693/4254/3480/4908/5159/1292/3908/2690/3909/8817/9223/4915/3551/2791/63923/3913/9863/3667/1287/3679/7060/3479/80310/1311/5105/2066/1101
# hsa04510                                                                                                                         5228/7424/1499/4636/83660/7059/5295/1288/23396/3910/3371/3082/1291/394/3791/7450/596/3685/1280/3675/595/2318/3912/1793/1278/1277/1293/10398/55742/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/1287/3679/7060/3479/10451/80310/1311/1101
# hsa03013                                                                                         10460/1978/55110/54913/9688/8894/11260/10799/9631/4116/5042/8761/6396/23165/8662/10248/55706/79833/9775/29107/23636/5905/9513/5901/10775/10557/4927/79902/1981/26986/11171/10762/8480/8891/11097/26019/10940/4686/9972/81929/10556/3646/9470/387082/1977/57122/8563/7514/79023/3837/9818/56000

重头戏——功能富集结果的可视化

这时要用到另一个包:enrichplot,它可以实现多种可视化,更好地说明富集分析结果。

library(enrichplot)

Bar plot

oragene <- enrichDGN(gene)
barplot(oragene,showCategory = 20)
## 该函数默认参数为:
   enrichDGN(gene, pvalueCutoff = 0.05, pAdjustMethod = "BH", universe,
   minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2,
   readable = FALSE)
图片.png

Dot Plot

gseagene <- gseNCG(geneList, nPerm=10000)
p1 <- dotplot(oragene, showCategory=30) + ggtitle("dotplot for ORA")
p2 <- dotplot(gseagene, showCategory=30) + ggtitle("dotplot for GSEA")
plot_grid(p1, p2, ncol=2)
## 一个瘦一个胖好丑啊
图片.png

Gene-Concept Network

前面的 两款神器 两个函数,都只能展示富集最显著的 GO term,而函数 cnetplot()可以将基因与生物学概念 (e.g.* GO terms or KEGG pathways) 的关系绘制成网状图。

oragnx <- setReadable(oragene, 'org.Hs.eg.db', 'ENTREZID')  ## 将 Gene ID 转换为 symbol
cnetplot(orangx, foldChange=geneList)
图片.png
cnetplot(oragnx, categorySize="pvalue", foldChange=geneList)  ## categorySize 可以是 "pvalue" 或 "geneNum" 
图片.png
cnetplot(oragnx, foldChange=geneList, circular = TRUE, colorEdge = TRUE)  ## 圆形布局,给线条上色
图片.png

Heatmap

热图能够简化结果,更容易分辨表达模式 (expression patterns) 。
heatplot(oragnx, foldChange=geneList)

Enrichment Map

Enrichment Map 可以将富集条目和重叠的基因集整合为一个网络图,相互重叠的基因集则趋向于成簇,从而易于分辨功能模型。

emapplot(oragene)
图片.png

UpSet Plot

函数 upsetplot()cneplot()的一种替代方案,用于可视化基因与基因集间的复杂关联,而upsetplot()更着重于不同基因集间基因的重叠情况。

upsetplot(oragene)
图片.png

GSEA结果的表达分布叠嶂图 (ridgeline plot for expression distribution of GSEA result)

更直观地展示上调/下调的通路。

ridgeplot(gseagene)
图片.png

富集条目在 PubMed 上的趋势

函数pmcplot()可以就某些富集 条目/通路 绘制其在 PubMed 上的 数量/比例 折线图。

terms <- oragene$Description[1:3]
p <- pmcplot(terms, 2012:2019) ## 默认为proportion=TRUE
p2 <- pmcplot(terms, 2012:2019, proportion=FALSE)
plot_grid(p, p2, ncol=2)
图片.png

goplot

函数goplot() 接受enrichGO()的输出,并将其可视化。

图片.png

browseKEGG

函数browseKEGG 可以帮你打开浏览器,嗯。

browseKEGG(kk, 'hsa04110')

pathview 包里的 上帝视角 PATHVIEW!

library(pathview)
hsa04110 <- pathview(gene.data  = geneList,
                     pathway.id = "hsa04110",
                     species    = "hsa",
                     limit      = list(gene=max(abs(geneList)), cpd=1)) ## cpd, compound
# Info: Downloading xml files for hsa04110, 1/1 pathways..
# Info: Downloading png files for hsa04110, 1/1 pathways..
# 'select()' returned 1:1 mapping between keys and columns
# Info: Working in directory /YOUR PATH/Project/clusterProfiler
# Info: Writing image file hsa04110.pathview.png
图片.png

作者:鬼笔环钛汰太C_137
链接:https://www.jianshu.com/p/e133ab3169fa
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

上一篇下一篇

猜你喜欢

热点阅读