NGS生物信息学与算法R

BiocPkgTools:分析Bioconductor中的R包关

2019-10-29  本文已影响0人  周运来就是我

Bioconductor拥有丰富的元数据生态系统,包括包、使用和构建状态。这个包是一个简单的函数集合,用于以整洁的数据格式访问来自R的元数据。其目标是公开数据挖掘和增值功能(如包搜索、文本挖掘和包分析)的元数据。

Functionality includes access to :

Download statistics
General package listing
Build reports
Package dependency graphs
Vignettes

Bioconductor构建报告可以在网上以HTML页面的形式获得。然而,它们计算友好的。biocBuildReport函数对HTML进行了解析,生成一个整洁的数据。方便分析Bioconductor中的R包相互关系,为用户寻找和探索已有的R包提供便捷。

library(BiocPkgTools)
head(biocBuildReport())

## # A tibble: 6 x 9
##   pkg   version author commit last_changed_date   node  stage result
##   <chr> <chr>   <chr>  <chr>  <dttm>              <chr> <chr> <chr> 
## 1 a4     1.31.0 Tobia…  a53c… 2018-10-30 00:00:00 malb… inst… OK    
## 2 a4     1.31.0 Tobia…  a53c… 2018-10-30 00:00:00 malb… buil… OK    
## 3 a4     1.31.0 Tobia…  a53c… 2018-10-30 00:00:00 malb… chec… OK    
## 4 a4     1.31.0 Tobia…  a53c… 2018-10-30 00:00:00 toka… inst… OK    
## 5 a4     1.31.0 Tobia…  a53c… 2018-10-30 00:00:00 toka… buil… OK    
## 6 a4     1.31.0 Tobia…  a53c… 2018-10-30 00:00:00 toka… chec… OK    
## # … with 1 more variable: bioc_version <chr>

因为开发人员可能对他们自己的包的快速视图感兴趣,所以有一个简单的函数,problemPage,来生成一个与给定作者regex匹配的包构建状态的HTML报告。默认情况下只报告“问题”构建状态(错误、警告)。

problemPage()

Bioconductor提供所有软件包的下载统计数据。biocDownloadStats函数获取所有实验数据、注释数据和软件包中所有包的所有可用下载统计信息。结果以整洁的数据形式返回,作为进一步分析的框架.

head(biocDownloadStats())

## # A tibble: 6 x 7
##   Package  Year Month Nb_of_distinct_IPs Nb_of_downloads repo     Date      
##   <chr>   <int> <chr>              <int>           <int> <chr>    <date>    
## 1 ABarray  2019 Jan                  104             210 Software 2019-01-01
## 2 ABarray  2019 Feb                   80             164 Software 2019-02-01
## 3 ABarray  2019 Mar                  144             192 Software 2019-03-01
## 4 ABarray  2019 Apr                  140             259 Software 2019-04-01
## 5 ABarray  2019 May                    0               0 Software 2019-05-01
## 6 ABarray  2019 Jun                    0               0 Software 2019-06-01

每个R包的描述文件包含大量关于包作者、依赖、版本等的信息。在诸如Bioconductor这样的存储库中,这些详细信息可用于所有包含的包。biocPkgList返回一个数据。大量的信息是可用的,结果的列名证明了这一点。

bpi = biocPkgList()
colnames(bpi)

##  [1] "Package"                   "Version"                  
##  [3] "Depends"                   "Suggests"                 
##  [5] "License"                   "MD5sum"                   
##  [7] "NeedsCompilation"          "Title"                    
##  [9] "Description"               "biocViews"                
## [11] "Author"                    "Maintainer"               
## [13] "git_url"                   "git_branch"               
## [15] "git_last_commit"           "git_last_commit_date"     
## [17] "Date/Publication"          "source.ver"               
## [19] "win.binary.ver"            "mac.binary.el-capitan.ver"
## [21] "vignettes"                 "vignetteTitles"           
## [23] "hasREADME"                 "hasNEWS"                  
## [25] "hasINSTALL"                "hasLICENSE"               
## [27] "Rfiles"                    "Enhances"                 
## [29] "dependsOnMe"               "Imports"                  
## [31] "importsMe"                 "suggestsMe"               
## [33] "LinkingTo"                 "Archs"                    
## [35] "VignetteBuilder"           "URL"                      
## [37] "SystemRequirements"        "BugReports"               
## [39] "Video"                     "linksToMe"                
## [41] "OS_type"                   "License_restricts_use"    
## [43] "PackageStatus"             "License_is_FOSS"          
## [45] "organism"
head(bpi)

## # A tibble: 6 x 45
##   Package Version Depends Suggests License MD5sum NeedsCompilation Title
##   <chr>   <chr>   <list>  <list>   <chr>   <chr>  <chr>            <chr>
## 1 a4      1.31.0  <chr [… <chr [4… GPL-3   31072… no               Auto…
## 2 a4Base  1.31.0  <chr [… <chr [2… GPL-3   2dec7… no               Auto…
## 3 a4Clas… 1.31.0  <chr [… <chr [1… GPL-3   4bbcd… no               Auto…
## 4 a4Core  1.31.0  <chr [… <chr [1… GPL-3   a2c0c… no               Auto…
## 5 a4Prep… 1.31.0  <chr [… <chr [2… GPL-3   087b7… no               Auto…
## 6 a4Repo… 1.31.0  <chr [… <chr [1… GPL-3   1635a… no               Auto…
## # … with 37 more variables: Description <chr>, biocViews <list>,
## #   Author <list>, Maintainer <list>, git_url <chr>, git_branch <chr>,
## #   git_last_commit <chr>, git_last_commit_date <chr>,
## #   `Date/Publication` <chr>, source.ver <chr>, win.binary.ver <chr>,
## #   `mac.binary.el-capitan.ver` <chr>, vignettes <list>,
## #   vignetteTitles <list>, hasREADME <chr>, hasNEWS <chr>, hasINSTALL <chr>,
## #   hasLICENSE <chr>, Rfiles <list>, Enhances <list>, dependsOnMe <list>,
## #   Imports <list>, importsMe <list>, suggestsMe <list>, LinkingTo <list>,
## #   Archs <list>, VignetteBuilder <chr>, URL <chr>,
## #   SystemRequirements <chr>, BugReports <chr>, Video <chr>,
## #   linksToMe <list>, OS_type <chr>, License_restricts_use <chr>,
## #   PackageStatus <chr>, License_is_FOSS <chr>, organism <chr>

作为如何使用这些列的简单示例,提取importsMe列来查找导入GEOquery包的包。

require(dplyr)
bpi = biocPkgList()
bpi %>% 
    filter(Package=="GEOquery") %>%
    pull(importsMe) %>%
    unlist()

##  [1] "bigmelon"        "ChIPXpress"      "coexnet"         "crossmeta"      
##  [5] "EGAD"            "GAPGOM"          "GSEABenchmarkeR" "MACPET"         
##  [9] "minfi"           "MoonlightR"      "phantasus"       "recount"        
## [13] "SRAdb"
Package Explorer

对于Bioconductor的最终用户,分析通常从找到一个或一组执行所需任务的包开始,或者根据特定的操作或数据类型进行定制。biocExplore()函数实现了一个交互式气泡可视化,并基于biocViews术语进行过滤。气泡的大小是根据下载统计数据确定的。工具提示和单击细节功能也包括在内。启动本地会话:

Dependency graphs

Bioconductor生态系统是围绕互操作性和依赖性的概念构建的。这些相互依赖关系可以作为biocPkgList()输出的一部分。BiocPkgTools提供了一些方便的函数来将包依赖关系转换为R图。

  1. Create a data.frame of dependencies using buildPkgDependencyDataFrame.
  2. Create an igraph object from the dependency data frame using buildPkgDependencyIgraph
  3. Use native igraph functionality to perform arbitrary network operations. Convenience functions, inducedSubgraphByPkgs and subgraphByDegree are available.
  4. Visualize with packages such as visNetwork.
Working with dependency graphs
library(BiocPkgTools)
dep_df = buildPkgDependencyDataFrame()
g = buildPkgDependencyIgraph(dep_df)
g

## IGRAPH a244fce DN-- 3113 25939 -- 
## + attr: name (v/c), edgetype (e/c)
## + edges from a244fce (vertex names):
##  [1] a4       ->a4Base        a4       ->a4Preproc    
##  [3] a4       ->a4Classif     a4       ->a4Core       
##  [5] a4       ->a4Reporting   a4Base   ->methods      
##  [7] a4Base   ->graphics      a4Base   ->grid         
##  [9] a4Base   ->Biobase       a4Base   ->AnnotationDbi
## [11] a4Base   ->annaffy       a4Base   ->mpm          
## [13] a4Base   ->genefilter    a4Base   ->limma        
## [15] a4Base   ->multtest      a4Base   ->glmnet       
## + ... omitted several edges
library(igraph)
head(V(g))

## + 6/3113 vertices, named, from a244fce:
## [1] a4          a4Base      a4Classif   a4Core      a4Preproc   a4Reporting

head(E(g))

## + 6/25939 edges from a244fce (vertex names):
## [1] a4    ->a4Base      a4    ->a4Preproc   a4    ->a4Classif  
## [4] a4    ->a4Core      a4    ->a4Reporting a4Base->methods

有关图形分析、设置顶点和边属性以及高级子设置的更多细节,请参见igraph文档。

Graph visualization

visNetwork包是一个很好的交互式可视化工具,可以在浏览器中实现图形绘制。它可以集成到Rmarkdown 的应用程序中。交互式图形也可以包含在Rmarkdown文档中(参见vignette)。

igraph_network = buildPkgDependencyIgraph(buildPkgDependencyDataFrame())

尽管这样做是可能的,但完整的依赖关系图实际上并不能提供足够的信息。一个常见的用例是将依赖关系图“集中”在感兴趣的包上。在本例中,我将重点介绍GEOquery包。

igraph_geoquery_network = subgraphByDegree(igraph_network, "GEOquery")

The subgraphByDegree() function returns all nodes and connections within degree of the named package; the default degree is 1.

visNework包可以直接绘制igraph对象,但是首先将图形转换为visNetwork形式可以提供更大的灵活性。

library(visNetwork)
data <- toVisNetworkData(igraph_geoquery_network)
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px")

有趣的是,我们可以看到图形在绘制过程中稳定下来,最好是交互式查看。

visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
    visPhysics(stabilization=FALSE)
data$edges$color='lightblue'
data$edges[data$edges$edgetype=='Imports','color']= 'red'
data$edges[data$edges$edgetype=='Depends','color']= 'green'

visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
    visEdges(arrows='from') 
ledges <- data.frame(color = c("green", "lightblue", "red"),
  label = c("Depends", "Suggests", "Imports"), arrows =c("from", "from", "from"))
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
  visEdges(arrows='from') %>%
  visLegend(addEdges=ledges)

Integration with BiocViews

library(biocViews)
data(biocViewsVocab)
biocViewsVocab

## A graphNEL graph with directed edges
## Number of Nodes = 476 
## Number of Edges = 475

library(igraph)
g = igraph.from.graphNEL(biocViewsVocab)
library(visNetwork)
gv = toVisNetworkData(g)
visNetwork(gv$nodes, gv$edges, width="100%") %>%
    visIgraphLayout(layout = "layout_as_tree", circular=TRUE) %>%
    visNodes(size=20) %>%
    visPhysics(stabilization=FALSE)


BiocPkgTools

上一篇下一篇

猜你喜欢

热点阅读