Clusterprofiler的一些技巧

2021-12-31 本文已影响0人一只烟酒僧

一、获得注释包中的GO信息

library(GOSemSim)
library(org.Hs.eg.db)
hsGO_BP<-godata("org.Hs.eg.db",keytype = "ENTREZID",ont = "BP",computeIC = T)

二、下载最新的KEGG数据库

download_KEGG("mmu",keyType = "ncbi-geneid")  ->res

三、将GOid转化为term

go2term("GO:0002576")

四、计算词条相似性

计算相似性的原理可以参考：https://yulab-smu.top/biomedical-knowledge-mining-book/semantic-similarity-overview.html
原文中会有词条、基因和基因集合的语义相似性分析，这里只说词条相似性

library(GOSemSim)
library(org.Hs.eg.db)


hsGO_BP<-godata("org.Hs.eg.db",keytype = "ENTREZID",ont = "BP",computeIC = T)

#随机取1000个term
set.seed(2022)
gobp_sample<-sample(hsGO_BP@geneAnno$GO%>%unique(),1000)

#-----------------------------------------------------------------------------
system.time(
  {
    gobp_similarity_wang<-mgoSim(gobp_sample,gobp_sample,hsGO_BP,measure = "Wang",combine = NULL)
  }
)

#-----------------------------------------------------------------------------
system.time(
  {
    gobp_similarity_lin<-mgoSim(hsGO_BP@geneAnno$GO%>%unique(),hsGO_BP@geneAnno$GO%>%unique(),hsGO_BP,measure = "Lin",combine = NULL)
  }
)

#-----------------------------------------------------------------------------
system.time(
  {
    gobp_similarity_rel<-mgoSim(gobp_sample,gobp_sample,hsGO_BP,measure = "Rel",combine = NULL)
  }
)

#-----------------------------------------------------------------------------

system.time(
  {
    gobp_similarity_Resnik<-mgoSim(gobp_sample,gobp_sample,hsGO_BP,measure = "Resnik",combine = NULL)
  }
)

#-----------------------------------------------------------------------------

system.time(
  {
    gobp_similarity_jiang<-mgoSim(gobp_sample,gobp_sample,hsGO_BP,measure = "Jiang",combine = NULL)
  }
)

微信截图_20211231145540.png

这些结果可以用来对GO term去冗余

#-----------------------------------------------------------------------------
#对前100做统计

gobp_similarity_wang_filter<-gobp_similarity_wang[1:100,1:100]

gobp_similarity_wang_filter[gobp_similarity_wang_filter<0.3]=0
rownames(gobp_similarity_wang_filter)<-rownames(gobp_similarity_wang_filter)%>%sapply(.,function(x){go2term(x)%>%pull(Term)})

gobp_similarity_wang_filter%>%pheatmap()

image.png

当然在这个包内本身存在一个函数simplify专门用来去冗余，其原理为：

simplify <- function(enrichResult, cutoff=0.7, by="p.adjust", select_fun=min) {
     ## GO terms that have semantic similarity higher than `cutoff` are treated as redundant terms
     ## select one representative term by applying `select_fun` to feature specifying by `by`.
     ## user can defined their own `select_fun` function.

     ## return an updated `enrichResult` object.
}

但是这只能用来对富集分析的结果用，同时会删除那些冗余的term。而像david和metascape在进行富集分析的时候，尤其是metascape，会保留冗余的term，同时把代表性的term(p最小)定义为summary，而其它term定义为member。

五、做注释分析(为每个基因找到它们的GO term)

有点慢，建议用godata()的返回值自己检索

groupGO("Hmgcs2",org.Mm.eg.db,"SYMBOL","BP",level = 5) ->res
res@result%>%filter(Count>0)

                   ID                                            Description Count GeneRatio geneID
GO:0006139 GO:0006139       nucleobase-containing compound metabolic process     1       1/1 Hmgcs2
GO:0043603 GO:0043603                       cellular amide metabolic process     1       1/1 Hmgcs2
GO:0072521 GO:0072521           purine-containing compound metabolic process     1       1/1 Hmgcs2
GO:0008299 GO:0008299                        isoprenoid biosynthetic process     1       1/1 Hmgcs2
GO:0008654 GO:0008654                      phospholipid biosynthetic process     1       1/1 Hmgcs2
GO:0046951 GO:0046951                       ketone body biosynthetic process     1       1/1 Hmgcs2
GO:0046165 GO:0046165                           alcohol biosynthetic process     1       1/1 Hmgcs2
GO:0008610 GO:0008610                             lipid biosynthetic process     1       1/1 Hmgcs2
GO:0090407 GO:0090407                   organophosphate biosynthetic process     1       1/1 Hmgcs2
GO:1901362 GO:1901362           organic cyclic compound biosynthetic process     1       1/1 Hmgcs2
GO:1901570 GO:1901570             fatty acid derivative biosynthetic process     1       1/1 Hmgcs2
GO:1901617 GO:1901617          organic hydroxy compound biosynthetic process     1       1/1 Hmgcs2
GO:1902224 GO:1902224                          ketone body metabolic process     1       1/1 Hmgcs2
GO:0035383 GO:0035383                            thioester metabolic process     1       1/1 Hmgcs2
GO:0006796 GO:0006796        phosphate-containing compound metabolic process     1       1/1 Hmgcs2
GO:0019637 GO:0019637                      organophosphate metabolic process     1       1/1 Hmgcs2
GO:0006644 GO:0006644                         phospholipid metabolic process     1       1/1 Hmgcs2
GO:0006720 GO:0006720                           isoprenoid metabolic process     1       1/1 Hmgcs2
GO:0055086 GO:0055086 nucleobase-containing small molecule metabolic process     1       1/1 Hmgcs2
GO:0008202 GO:0008202                              steroid metabolic process     1       1/1 Hmgcs2
GO:0044255 GO:0044255                       cellular lipid metabolic process     1       1/1 Hmgcs2
GO:1902652 GO:1902652                    secondary alcohol metabolic process     1       1/1 Hmgcs2
GO:0006753 GO:0006753                 nucleoside phosphate metabolic process     1       1/1 Hmgcs2
GO:0019693 GO:0019693                     ribose phosphate metabolic process     1       1/1 Hmgcs2
GO:0006637 GO:0006637                             acyl-CoA metabolic process     1       1/1 Hmgcs2
GO:0006066 GO:0006066                              alcohol metabolic process     1       1/1 Hmgcs2
GO:0016125 GO:0016125                               sterol metabolic process     1       1/1 Hmgcs2

六、对KEGG的module进行富集

相比于kegg通路，module反映了通路内局部的反应过程，代表了更准确的信息

mkk <- enrichMKEGG(gene = gene,
                   organism = 'hsa',
                   pvalueCutoff = 1,
                   qvalueCutoff = 1)
head(mkk)



##            ID                                             Description GeneRatio
## M00912 M00912      NAD biosynthesis, tryptophan => quinolinate => NAD       2/9
## M00095 M00095          C5 isoprenoid biosynthesis, mevalonate pathway       1/9
## M00053 M00053 Pyrimidine deoxyribonuleotide biosynthesis, CDP => dCTP       1/9
## M00938 M00938 Pyrimidine deoxyribonuleotide biosynthesis, UDP => dTTP       1/9
## M00003 M00003            Gluconeogenesis, oxaloacetate => fructose-6P       1/9
## M00049 M00049     Adenine ribonucleotide biosynthesis, IMP => ADP,ATP       1/9
##        BgRatio      pvalue   p.adjust     qvalue     geneID Count
## M00912  12/829 0.006541756 0.03925053 0.03443029 23475/3620     2
## M00095  10/829 0.103949536 0.20710097 0.18166751       3158     1
## M00053  11/829 0.113796244 0.20710097 0.18166751       6241     1
## M00938  14/829 0.142761862 0.20710097 0.18166751       6241     1
## M00003  18/829 0.180072577 0.20710097 0.18166751       5105     1
## M00049  21/829 0.207100966 0.20710097 0.18166751      26289     1

七、对KEGG通路可视化

相当于生成一条kegg库书写规则的url，至于进一步可视化，可以使用pathwayview完成

browseKEGG(kk, 'hsa04110')