生物信息学从零开始学菜🐣日记——走R包数据科学与R语言

topGO:大概是致力于GO分析方法论的R包

2019-06-11  本文已影响20人  美式永不加糖

Gene set enrichment analysis with topGO

Alexa A, Rahnenfuhrer J (2019). topGO: Enrichment Analysis for Gene Ontology. R package version 2.36.0.

DOI: 10.18129/B9.bioc.topGO

1. 关于 topGO & 安装三连

topGO 的设计旨在实现 GO term 的半自动富集分析。一个主要优点是它提供的统一化的基因集检测方案 (the unified gene set testing framework) 。接下来是对这个主要优点的展开,Besides, also 句型——轻松地应用函数完成GO富集分析、容易地执行新的统计学检测或新算法、对比不同的GO富集研究方法。

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("topGO")
library(topGO)
ls("package:topGO")
# [1] "algorithm"           "algorithm<-"        
# [3] "allGenes"            "allMembers"         
# [5] "allMembers<-"        "allParents"         
# [7] "allScore"            "annFUN"             
# [9] "annFUN.db"           "annFUN.file"        
# [11] "annFUN.gene2GO"      "annFUN.GO2genes"    
# [13] "annFUN.org"          "attrInTerm"         
# [15] "buildLevels"         "combineResults"     
# [17] "contTable"           "countGenesInTerm"   
# [19] "cutOff"              "cutOff<-"           
# [21] "depth"               "depth<-"            
# [23] "description"         "description<-"      
# [25] "elim"                "elim<-"             
# [27] "expressionMatrix"    "feasible"           
# [29] "feasible<-"          "geneData"           
# [31] "geneData<-"          "genes"              
# [33] "geneScore"           "geneSelectionFun"   
# [35] "geneSelectionFun<-"  "genesInTerm"        
# [37] "GenTable"            "getGraphRoot"       
# [39] "getNoOfLevels"       "getPvalues"         
# [41] "getSigGroups"        "getSigRatio"        
# [43] "GOBPTerm"            "GOCCTerm"           
# [45] "GOFisherTest"        "GOglobalTest"       
# [47] "GOKSTest"            "GOKSTiesTest"       
# [49] "GOMFTerm"            "GOplot"             
# [51] "GOSumTest"           "GOtTest"            
# [53] "graph"               "graph<-"            
# [55] "groupGOTerms"        "inducedGraph"       
# [57] "initialize"          "inverseList"        
# [59] "joinFun"             "members"            
# [61] "members<-"           "membersExpr"        
# [63] "membersScore"        "Name"               
# [65] "Name<-"              "nodesInInducedGraph"
# [67] "numAllMembers"       "numGenes"           
# [69] "numMembers"          "numSigAll"          
# [71] "numSigGenes"         "numSigMembers"      
# [73] "ontology"            "ontology<-"         
# [75] "penalise"            "permSumStats"       
# [77] "permSumStats.all"    "phenotype"          
# [79] "print"               "printGenes"         
# [81] "printGraph"          "pType"              
# [83] "pType<-"             "rankMembers"        
# [85] "readMappings"        "reverseArch"        
# [87] "runTest"             "score"              
# [89] "score<-"             "scoreOrder"         
# [91] "scoresInTerm"        "show"               
# [93] "showGroupDensity"    "showSigOfNodes"     
# [95] "sigAllMembers"       "sigGenes"           
# [97] "sigMembers"          "sigMembers<-"       
# [99] "sigRatio<-"          "termStat"           
# [101] "testName"            "testName<-"         
# [103] "testStatistic"       "testStatPar"        
# [105] "updateGenes"         "updateGroup"        
# [107] "updateTerm<-"        "usedGO"             
# [109] "Weights"             "Weights<-"          
# [111] "whichAlgorithms"     "whichTests"   

2. 数据准备

调用 ALL 包的内置数据。

library(ALL)
data("ALL")
data(geneList) ## geneList, a list of genes and the respective p-values for differential expression
affyLib <- paste(annotation(ALL), "db", sep = ".") 
library(package = affyLib, character.only = TRUE)
# 载入需要的程辑包:org.Hs.eg.db
## 载入多个包时,加上 character.only = TRUE 参数,可以确保正确载入

函数 topDiffGenes() 可以在 p-values=0.01 的显著水平上筛选 geneList 中的差异表达基因。

sum(topDiffGenes(geneList)) 
# [1] 50

万事俱备,开始 构建 topGOdata ,这个对象包含全部的基因指标 (gene

identifiers) 和其数值、GO注释、GO 层次结构 (GO hierarchical structure) 以及其他用于富集分析的信息。

sampleGOdata <- new("topGOdata",
                    description = "Simple session",
                    ontology = "BP",
                    allGenes = geneList,
                    geneSel = topDiffGenes, ## 设定筛选函数
                    nodeSize = 10, ## 删去少于10个注释基因的GO term
                    annot = annFUN.db, ## maps genes identifiers to GO terms from the affyLib object
                    affyLib = affyLib)
sampleGOdata
# 
# ------------------------- topGOdata object -------------------------
# 
#  Description:
#    -  Simple session 
# 
#  Ontology:
#    -  BP 
# 
#  323 available genes (all genes from the array):
#    - symbol:  1095_s_at 1130_at 1196_at 1329_s_at 1340_s_at  ...
#    - score :  1 1 0.62238 0.541224 1  ...
#    - 50  significant genes. 
# 
#  309 feasible genes (genes that can be used in the analysis):
#    - symbol:  1095_s_at 1130_at 1196_at 1329_s_at 1340_s_at  ...
#    - score :  1 1 0.62238 0.541224 1  ...
#    - 45  significant genes. 
# 
#  GO graph (nodes with at least  10  genes):
#    - a graph with directed edges
#    - number of nodes = 1072 
#    - number of edges = 2319 
# 
# ------------------------- topGOdata object -------------------------

3. 富集检测

3.1 经典 over-representation GO富集分析

resultFisher <- runTest(sampleGOdata, algorithm = "classic", statistic = "fisher") 
# 
#            -- Classic Algorithm -- 
# 
#        the algorithm is scoring 964 nontrivial nodes
#        parameters: 
#            test statistic: fisher
resultFisher
# 
# Description: Simple session 
# Ontology: BP 
# 'classic' algorithm with the 'fisher' test
# 1072 GO terms scored: 45 terms with p < 0.01
# Annotation data:
#     Annotated genes: 309 
#     Significant genes: 45 
#     Min. no. of genes annotated to a GO: 10 
#     Nontrivial nodes: 964 
class(resultFisher)
# [1] "topGOresult"
# attr(,"package")
# [1] "topGO"

3.2 利用 Kolmogorov-Smirnov like test 的富集分析

两种方法:classic, elim

'elim' : this method processes the GO terms by traversing the GO hierarchy from bottom to top.

'classic': each GO term is tested independently, not taking the GO hierarchy into account.

resultKS.classic <- runTest(sampleGOdata, algorithm = "classic", statistic = "ks") 
# 
#            -- Classic Algorithm -- 
# 
#        the algorithm is scoring 1072 nontrivial nodes
#        parameters: 
#            test statistic: ks
#            score order: increasing
resultKS.classic
# 
# Description: Simple session 
# Ontology: BP 
# 'classic' algorithm with the 'ks' test
# 1072 GO terms scored: 83 terms with p < 0.01
# Annotation data:
#     Annotated genes: 309 
#     Significant genes: 45 
#     Min. no. of genes annotated to a GO: 10 
#     Nontrivial nodes: 1072 
resultKS.elim <- runTest(sampleGOdata, algorithm = "elim", statistic = "ks")
# 
#            -- Elim Algorithm -- 
# 
#        the algorithm is scoring 1072 nontrivial nodes
#        parameters: 
#            test statistic: ks
#            cutOff: 0.01
#            score order: increasing
resultKS.elim
# 
# Description: Simple session 
# Ontology: BP 
# 'elim' algorithm with the 'ks : 0.01' test
# 1072 GO terms scored: 24 terms with p < 0.01
# Annotation data:
#     Annotated genes: 309 
#     Significant genes: 45 
#     Min. no. of genes annotated to a GO: 10 
#     Nontrivial nodes: 1072 

ps. 查看所有允许的算法和T检验方法

whichAlgorithms() 
# [1] "classic"     "elim"        "weight"      "weight01"   
# [5] "lea"         "parentchild"
whichTests() 
# [1] "fisher"     "ks"         "t"          "globaltest"
# [5] "sum"        "ks.ties"   

4. 结果分析

4.1 得到富集显著 GO term 的 data frame

函数 GenTable() 可用于分析富集最为显著的 GO term 和相对应的p值。

allRes <- GenTable(sampleGOdata,classicFisher = resultFisher,
                   classicKS = resultKS.classic,
                   elimKS = resultKS.elim,
                   orderBy = "elimKS",ranksOf = "classicFisher",
                   topNodes = 10)

4.2 不同算法所得p值的对比及可视化

函数 score() 可以得到 topGOresult 对象中 GO term 的p值。

pValue.classic <- score(resultKS.classic) 
pValue.elim <- score(resultKS.elim)[names(pValue.classic)] 
gstat <- termStat(sampleGOdata, names(pValue.classic))  ## to summarise the statistics
colMap <- function(x) {
  .col <- rep(rev(heat.colors(length(unique(x)))), time = table(x))
  return(.col[match(1:length(x), order(x))])
}
gCol <- colMap(gstat$Significant)
plot(pValue.classic, pValue.elim, 
     xlab = "p-value classic", 
     ylab = "p-value elim", 
     pch = 19, cex = gSize, col = gCol)

plot(log10(pValue.classic), log10(pValue.elim), 
     xlab = "lg(p-value) classic", 
     ylab = "lg(p-value) elim", 
     pch = 19, cex = gSize, col = gCol)

可以看出两种方法得到的p值确实是有差异的,所以显然存在一些被A方法认定显著——而在B方法下不显著的 GO term. 可以通过下面的方法找到它们:

sel.go <- names(pValue.classic)[pValue.elim < pValue.classic] 
cbind(termStat(sampleGOdata, sel.go), 
      elim = pValue.elim[sel.go],
      classic = pValue.classic[sel.go])
#            Annotated Significant Expected       elim    classic
# GO:0006807       224          36    32.62 0.44604086 0.53079040
# GO:0019538       164          23    23.88 0.37042531 0.47406305
# GO:0043170       206          35    30.00 0.05327402 0.06576399
# GO:1901564       189          26    27.52 0.73230230 0.90604207

4.3 GO Graph

方形为显著性前五的 GO term.

showSigOfNodes(sampleGOdata, score(resultKS.elim), 
               firstSigNodes = 5, useInfo = 'all')

5. 其他的可视化技能

5.1 p值的直方图

hist(pValue.classic, 50, xlab = "p-values_KSclassic")

5.2 分析单个GO term

函数 showGroupDensity() 可以解释显著富集到某个GO term的基因是否比其他基因有更高的分值。

goID <- allRes[1, "GO.ID"]
showGroupDensity(sampleGOdata, goID, ranks = TRUE)

References

悄咪咪吐槽一下作者

最后,向大家隆重推荐生信技能树的一系列干货!

  1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
上一篇下一篇

猜你喜欢

热点阅读