一个R函数解决生物学ID转换的问题
2020-01-19 本文已影响0人
小光amateur
前言:
生物ID转换是我们在处理各种生物数据时经常遇到的问题。通常有两种方法:一种是使用在线网站,最著名的是biomart和db2db;另一种是使用本地软件clusterProfiler::bitr
。
在线转换过程很麻烦,需要上传和下载文件,并且需要进行二次处理。另外,如果转换次数很多,将很难完成。本地转换的数据库更新速度很慢,许多转换无法完成,并且转换次数很少。
举一个简单的例子,在此项目下有一个示例文件test_name.txt。文件是100个 Ensmebl Trans ID
。如果要执行下游分析,则必须转换为Gene Symbol
。
如果使用bitr
函数,我们只能得到少量映射:
library(clusterProfiler)
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
# [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID"
# [7] "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" "GO" "GOALL"
#[13] "IPI" "MAP" "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH"
#[19] "PFAM" "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
#[25] "UNIGENE" "UNIPROT"
result<-bitr(data$gene,fromType = "ENSEMBLTRANS",toType = "SYMBOL",OrgDb = org.Hs.eg.db)
#'select()' returned 1:1 mapping between keys and columns
#Warning message:
#In bitr(data$gene, fromType = "ENSEMBLTRANS", toType = "SYMBOL", :
# 84% of input gene IDs are fail to map...
head(result)
# ENSEMBLTRANS SYMBOL
#7 ENST00000418724 ZBTB22
#17 ENST00000374458 GGNBP1
#21 ENST00000588265 FXYD7
#25 ENST00000458629 CXCR6
#34 ENST00000595168 LOC400499
#38 ENST00000368547 ECHS1
但是,如果我们从bioDBnet
网站获得信息,我们只有2个不匹配的ID,因此,我希望通过打包网站的api来减少在线转换的弊端并提高转换效率。
使用方法
library(RCurl)
#library(httr)
## if your compute is windows,you should use httr instead of rcurl
library(rjson)
library(tidyr)
###read example data
data<-read.table("test_name.txt",header = FALSE,stringsAsFactors = FALSE)
colnames(data)<-"gene"
## you can get all input characters you can by inputting "getinputs" as the first parameter
bitr_db2db("getinputs")
# [1] "Affy GeneChip Array" "Affy ID" "Affy Transcript Cluster ID"
# [4] "Agilent ID" "Biocarta Pathway Name" "CodeLink ID"
# [7] "dbSNP ID" "DrugBank Drug ID" "DrugBank Drug Name"
#[10] "EC Number" "Ensembl Gene ID" "Ensembl Protein ID"
# ........
## you can get all output characters you can got by inputting "getoutputsforinput" as the first parameter。
bitr_db2db("getoutputsforinput","Ensembl Transcript ID")
# [1] "Affy ID" "Agilent ID" "Allergome Code"
# [4] "ApiDB_CryptoDB ID" "Biocarta Pathway Name" "BioCyc ID"
# [7] "CCDS ID" "Chromosomal Location" "CleanEx ID"
# [10] "CodeLink ID" "COSMIC ID" "CPDB Protein Interactor"
# ....
##to get ensmebl trans 2 symbol,you can input the following cmd.
haha<-bitr_db2db("","Ensembl Transcript ID",data$gene,"Gene Symbol")
#[1] "your id have 1:1 mapping!"
head(haha)
# from to
#1 ENST00000532435 GDPD5
#2 ENST00000513185 RGMB
#3 ENST00000569370 CIAPIN1
#4 ENST00000451562 PPIA
#5 ENST00000289865 USP21
#6 ENST00000409411 PREPL
#when you make one 2 more like gene HTT
haha2<-bitr_db2db("","Gene Symbol","HTT","Ensembl Transcript ID")
[1] "waring:your id have more than one mapping!"
head(haha2)
# from to
#1 HTT ENSCJAT00000027377
#2 HTT ENSBMUT00000040493
#3 HTT ENSBMUT00000040501
#4 HTT ENSBMUT00000040486
#5 HTT ENSBMUT00000040494
#6 HTT ENSBMUT00000040497
最后,代码见bitr_db2db.R
注意:如果你使用的是Windows并且报错的话,建议尝试这个bitr_db2db_forwindows.R