数据科学与R语言菜🐣日记——走R包生物信息学从零开始学

annotate: ID 转换终结者+数据包&NCBI查

2019-06-22  本文已影响6人  美式永不加糖

Annotation for microarrays

Gentleman R (2019). annotate: Annotation for microarrays. R package version 1.62.0.

1. 关于 annotate

作者是 R. Gentleman, 不知道是不是这位真の大佬👇 (似乎是的,大佬的姓氏一看就是大佬👀


包内函数主要分为四类:

简单 并不简单地分别试跑一下前三项的示例代码,再尝试一下 Boss Jimmy 博客提到过的 “ID转换终结者” ┗|`O′|┛ 、

2. 从芯片元数据库 (meta-data libraries) 提取数据

以 Affymetrix HGu95av2 芯片为例,比如想得到 GO 相关数据。

affyGO <- eapply(hgu95av2GO, getOntology)
## 查看各探针对应的 GO ID 数目
table(sapply(affyGO, length)) 
# 
#    0    1    2    3    4    5    6    7    8    9   10   11   12 
# 1901 1577 1866 1768 1440 1136  744  575  367  322  239  154  137 
#   13   14   15   16   17   18   19   20   21   22   23   24   25 
#   89   57   68   27   30   19   17   14   11   11    7    5    2 
#   26   27   28   29   30   32   37 
#   16    3    3    8    1    6    5 

查看各探针对应的 GO evidence codes 及数目:

affyEv <- eapply(hgu95av2GO, getEvidence)
table(unlist(affyEv, use.names=FALSE))
# 
#  EXP   HDA   HEP   HMP    IC   IDA   IEA   IEP   IGI   IMP   IPI 
#  457  6758    80   114  1302 59621 65824  1073  1759 19887 14319 
#  ISA   ISM   ISS   NAS    ND   RCA   TAS 
#  979   743 23946  7448   675   146 43804 

A GO annotation is a statement about the function of a particular gene. Each annotation includes an evidence code to indicate how the annotation to a particular term is supported.

Evidence codes fall into six general categories:

  • experimental evidence
  • phylogenetic evidence
  • computational evidence
  • author statements
  • curatorial statements
  • automatically generated annotations

删除特定的 GO evidence codes:

affyEv_drop <- eapply(hgu95av2GO, dropECode, c("IEA", "ND"))
table(unlist(sapply(affyEv_drop, getEvidence),
             use.names = FALSE))
# 
#   EXP   HDA   HEP   HMP    IC   IDA   IEP   IGI   IMP   IPI 
#   457  6758    80   114  1302 59621  1073  1759 19887 14319 
#   ISA   ISM   ISS   NAS   RCA   TAS 
#   979   743 23946  7448   146 43804 

3. 在线查询信息

需要和 BiobaseXML 这两个包配合使用。

3.1 四个直截了当的函数

genbankpubmed 可通过参数选择返回 XML 格式数据或打开浏览器,entrezGeneByIDentrezGeneQuery 则可返回 URL 链接。

library(annotate)
entrezGeneQuery(c("leukemia", "Homo sapiens"))
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=leukemia%20Homo sapiens"
## emmmmm这样是会报错的
entrezGeneQuery("leukemia", "Homo sapiens") 
# Error in entrezGeneQuery("leukemia", "Homo sapiens") : 
#   unused argument ("Homo sapiens")

当直接输入 UniGene ID 时, entrezGeneByIDentrezGeneQuery 可以说是等同的:

entrezGeneByID(c("100", "1002"))
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=100" 
# [2] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=1002"
entrezGeneQuery(100)
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=100"
entrezGeneByID(100)
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=100"

3.2 获取 PubMed 信息

要用到 Biobase 包内的数据:sample.ExpressionSet .

data(sample.ExpressionSet) 
## 取其中11个基因作为示例
affys <- featureNames(sample.ExpressionSet)[490:500] 
affys
# [1] "31729_at" "31730_at" "31731_at" "31732_at" "31733_at" "31734_at"
# [7] "31735_at" "31736_at" "31737_at" "31738_at" "31739_at"

这时需要把 Affymetrix id (identifiers) 转换成 PubMed ID,以便接下来 pubmed 的愉快操作。

library(hgu95av2.db) 
ids <- getPMID(affys,"hgu95av2") 

得到了一个有11个元素的 list,接下来对它进行 unlist.

ids <- unlist(ids,use.names=FALSE) 
## 去重
ids <- unique(ids[!is.na(as.numeric(ids))]) 
length(ids) 
# [1] 731
ids[1:10]
# [1] "11438666" "12477932" "12878157" "15489334" "16710414" "17375202"
# [7] "17643375" "17884155" "18029348" "19240132"

pubmed 返回了 XMLDocument 对象,即用11个 Affymetrix id

得到了731个 PMID, 再取前10个获取其他信息。

x <- pubmed(ids[1:10]) 
class(x)
[1] "XMLDocument"         "XMLAbstractDocument"
a <- xmlRoot(x) 
numAbst <- length(xmlChildren(a)) 
numAbst 
# [1] 10

构建 PubMedAbst 对象,提取摘要文本:

arts <- vector("list", length=numAbst) 
absts <- rep(NA, numAbst) 
for (i in 1:numAbst) {
  arts[[i]] <- buildPubMedAbst(a[[i]])
  absts[i] <- abstText(arts[[i]])
}
arts[[3]] 
# An object of class 'pubMedAbst':
# Title: Redifferentiation of dedifferentiated
#      chondrocytes and chondrogenesis of human
#      bone marrow stromal cells via chondrosphere
#      formation with expression profiling by
#      large-scale cDNA analysis.
# PMID: 12878157
# Authors: H Imabayashi, T Mori, S Gojo, T Kiyono,
#      T Sugiyama, R Irie, T Isogai, J Hata, Y
#      Toyama, A Umezawa
# Journal: Exp Cell Res
# Date: Aug 2003
absts[3]
# [1] "Characterization of dedifferentiated chondrocytes (DECs) and mesenchymal stem cells capable of differentiating into chondrocytes is of biological and clinical interest. We isolated DECs and bone marrow stromal cells (BMSCs), H4-1 and H3-4, and demonstrated that the cells started to produce extracellular matrices, such as type II collagen and aggrecan, at an early stage of chondrosphere formation. Furthermore, cDNA sequencing of cDNA libraries constricted by the oligocapping method was performed to analyze difference in mRNA expression profiling between DECs and marrow stromal cells. Upon redifferentiation of DECs, cartilage-related extracellular matrix genes, such as those encoding leucine-rich small proteoglycans, cartilage oligomeric matrix protein, and chitinase 3-like 1 (cartilage glycoprotein-39), were highly expressed. Growth factors such as FGF7 and CTGF were detected at a high frequency in the growth stage of monolayer stromal cultures. By combining the expression profile and flow cytometry, we demonstrated that isolated stromal cells, defined by CD34(-), c-kit(-), and CD140alpha(- or low), have chondrogenic potential. The newly established human mesenchymal cells with expression profiling provide a powerful model for a study of chondrogenic differentiation and further understanding of cartilage regeneration in the means of redifferentiated DECs and BMSCs."

4. 染色体信息的构建及利用

利用函数 buildChromLocation() 构建 chromLocation 对象:

z <- buildChromLocation("hgu95av2") 
z
# Instance of a chromLocation class with the following fields:
#   Organism:  Homo sapiens 
#   Data source:  hgu95av2 
#   Number of chromosomes for this organism:  595 
#   Chromosomes of this organism and their lengths in base pairs:
#        1 : 248956422
#        2 : 242193529
#        3 : 198295559
#        4 : 190214555
#        5 : 181538259
#        6 : 170805979
#        7 : 159345973
#        X : 156040895
#        8 : 145138636
#        9 : 138394717
#        11 : 135086622
#        10 : 133797422
#        12 : 133275309
#        13 : 114364328
#        14 : 107043718
#        15 : 101991189
#        16 : 90338345
#        17 : 83257441
#        18 : 80373285
#        20 : 64444167
#        19 : 58617616
#        Y : 57227415
#        22 : 50818468
#        21 : 46709983
#        8_KZ208915v1_fix : 6367528
#        15_KI270905v1_alt : 5161414
#        15_KN538374v1_fix : 4998962
#        6_GL000256v2_alt : 4929269
#        6_GL000254v2_alt : 4827813
#        6_GL000251v2_alt : 4795265
#        6_GL000253v2_alt : 4677643
#        6_GL000250v2_alt : 4672374
#        6_GL000255v2_alt : 4606388
#        6_GL000252v2_alt : 4604811
#        17_KI270857v1_alt : 2877074
#        16_KI270853v1_alt : 2659700
# ...
## 后面太长不贴出来了

这个 S4 对象有6个 slot.

相关信息获取:

## 物种信息
organism(z)
# [1] "Homo sapiens"

## 数据来源
dataSource(z)
# [1] "hgu95av2"

## 查看染色体位点
head(names(chromLocs(z)), n = 10) 
#  [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18"
tail(names(chromLocs(z)), n = 10) 
#  [1] "19_KV575249v1_alt" "19_KV575251v1_alt"
#  [3] "19_KV575252v1_alt" "19_KV575253v1_alt"
#  [5] "19_KV575254v1_alt" "19_KV575255v1_alt"
#  [7] "19_KV575256v1_alt" "19_KV575257v1_alt"
#  [9] "19_KV575259v1_alt" "19_KV575260v1_alt"

## 定位探针所在染色体
get("32972_at", probesToChrom(z))
# [1] "X"

## 查看染色体名称及其长度
head(chromInfo(z))
#         1         2         3         4         5 
# 248956422 242193529 198295559 190214555 181538259 
#         6 
# 170805979 
tail(chromInfo(z))
# Un_KI270312v1 Un_KI270539v1 Un_KI270385v1 Un_KI270423v1 
#           998           993           990           981 
# Un_KI270392v1 Un_KI270394v1 
#           971           970 

## 查看探针所对应的 gene symbol
get("32972_at", geneSymbols(z))
# [1] "NOX1"

## 染色体数
nChrom(z)
# [1] 595

5. ID 转换终结者

5.1 getSymbol()

getSymbol() 为例的一系列函数:

getSYMBOL(x, data)
getEG(x, data)
getGO(x, data)
getPMID(x, data)
getGOdesc(x, which)
lookUp(x, data, what, load = FALSE)

probes <- featureNames(sample.ExpressionSet)[246:260]
getSYMBOL(probes, "hgu95av2.db")
#   31490_at 31491_s_at   31492_at 31493_s_at   31494_at 
#    "SCN5A"    "CASP8"    "EIF3K"         NA         NA 
#   31495_at 31496_g_at   31497_at 31498_f_at 31499_s_at 
#     "XCL2"         NA         NA         NA   "FCGR3B" 
getEG(probes, "hgu95av2.db")
#   31490_at 31491_s_at   31492_at 31493_s_at   31494_at 
#     "6331"      "841"    "27335"         NA         NA 
#   31495_at 31496_g_at   31497_at 31498_f_at 31499_s_at 
#     "6846"         NA         NA         NA     "2215" 
go <- getGO(probes, "hgu95av2.db")
getGOdesc(go[[1]][[1]][["GOID"]], "ANY")
# $`GO:0002027`
# GOID: GO:0002027
# Term: regulation of heart rate
# Ontology: BP
# Definition: Any process that modulates the frequency
#     or rate of heart contraction.
# Synonym: cardiac chronotropy
# Synonym: regulation of heart contraction rate
# Synonym: regulation of rate of heart contraction
getGOdesc(go[[1]][[1]][["GOID"]], "BP")
# $`GO:0002027`
# GOID: GO:0002027
# Term: regulation of heart rate
# Ontology: BP
# Definition: Any process that modulates the frequency
#     or rate of heart contraction.
# Synonym: cardiac chronotropy
# Synonym: regulation of heart contraction rate
# Synonym: regulation of rate of heart contraction

## 试错
getGOdesc(go[[1]][[1]][["GOID"]], "MF")
# NULL
lookUp(probes, "hgu95av2", "ENTREZID")
# $`31490_at`
# [1] "6331"
# 
# $`31491_s_at`
# [1] "841"
# 
# $`31492_at`
# [1] "27335"
# 
# $`31493_s_at`
# [1] NA
# 
# $`31494_at`
# [1] NA
# 
# $`31495_at`
# [1] "6846"
# 
# $`31496_g_at`
# [1] NA
# 
# $`31497_at`
# [1] NA
# 
# $`31498_f_at`
# [1] NA
# 
# $`31499_s_at`
# [1] "2215"

lookUp(go[[2]][[1]][["GOID"]], "GO", "ONTOLOGY")
# 'select()' returned 1:1 mapping between keys and columns
# $`GO:0006508`
# [1] "BP"

5.2 select()

先用 colums() 查看可以转换的项目:

columns(hgu95av2.db)
#  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"     
#  [4] "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
#  [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL" 
# [10] "GENENAME"     "GO"           "GOALL"       
# [13] "IPI"          "MAP"          "OMIM"        
# [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
# [19] "PFAM"         "PMID"         "PROBEID"     
# [22] "PROSITE"      "REFSEQ"       "SYMBOL"      
# [25] "UCSCKG"       "UNIGENE"      "UNIPROT"     
out <- select(hgu95av2.db, probes,  c("SYMBOL","ENTREZID", "GENENAME"))

References

上一篇 下一篇

猜你喜欢

热点阅读