ID转换那点事
在进行测序数据下游分析的时候常常需要用到不同的数据库,而这些数据库的分析的输入文件经常是各有区别,因此我们常常需要将各类ID(ensamble ID、Entrez ID、gene symbol等)进行转换。这里我们将简单介绍使用编程进行ID转换的方法。
芯片ID注释
芯片注释主要有两种方法:使用R包进行注释或在数据库中下载芯片平台信息然后手动注释,这两种方法其实基本都是相同的,都是利用芯片平台信息进行注释,区别只在于一个是利用R包API而另一个则是自己写代码。
载入数据:
> library('GEOquery')
> gse20986 <- getGEO("GSE20986", destdir=".") # 下载处理好的表达矩阵
> exprmt <- exprs(gse20986[[1]]);head(exprmt);class(exprmt)
GSM524662 GSM524663 GSM524664 GSM524665 GSM524666 GSM524667 GSM524668 GSM524669 GSM524670 GSM524671 GSM524672 GSM524673
1007_s_at 2.867331 3.941245 4.003509 3.230621 3.546867 2.507474 3.040758 2.456892 2.404267 3.568920 3.307409 3.225964
1053_at 10.503022 8.320770 9.523183 10.645904 8.140238 10.228986 11.019461 10.447158 10.799832 9.765222 10.069229 10.089496
117_at 2.702517 2.749144 3.226281 2.788124 4.664960 2.722543 2.745330 4.500872 2.702614 3.015139 2.768683 2.853809
121_at 3.052316 3.057630 3.056844 3.697705 3.057630 3.057630 3.057630 3.057630 3.072568 3.057630 3.057630 3.057630
1255_g_at 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998
1294_at 5.360226 4.465648 5.914568 5.558741 5.241448 5.037804 4.784192 4.978118 4.713441 4.303196 4.452989 4.766019
[1] "matrix"
# 将矩阵转换为数据框方便后面注释操作,并将行名称装换为ID
> exprmt <- as.data.frame(exprmt); exprmt$ID <- rownames(exprmt); head(exprmt)
GSM524662 GSM524663 GSM524664 GSM524665 GSM524666 GSM524667 GSM524668 GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 ID
1007_s_at 2.867331 3.941245 4.003509 3.230621 3.546867 2.507474 3.040758 2.456892 2.404267 3.568920 3.307409 3.225964 1007_s_at
1053_at 10.503022 8.320770 9.523183 10.645904 8.140238 10.228986 11.019461 10.447158 10.799832 9.765222 10.069229 10.089496 1053_at
117_at 2.702517 2.749144 3.226281 2.788124 4.664960 2.722543 2.745330 4.500872 2.702614 3.015139 2.768683 2.853809 117_at
121_at 3.052316 3.057630 3.056844 3.697705 3.057630 3.057630 3.057630 3.057630 3.072568 3.057630 3.057630 3.057630 121_at
1255_g_at 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 2.278998 1255_g_at
1294_at 5.360226 4.465648 5.914568 5.558741 5.241448 5.037804 4.784192 4.978118 4.713441 4.303196 4.452989 4.766019 1294_at
使用annotate包注释芯片
在数据集GSE20986页面可以看到芯片平台信息:
当然也可以直接通过代码获得芯片平台信息:
> gse20986
$GSE20986_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 12 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM524662 GSM524663 ... GSM524673 (12 total)
varLabels: title geo_accession ... tissue:ch1 (34 total)
varMetadata: labelDescription
featureData
featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL570
知道了芯片平台信息,接下来我们就可以使用annotate包进行注释了。
先查看一下有没有重复的芯片ID:
> dupID <- exprmt[duplicated(exprmt$ID),]; head(dupID); dim(dupID)
[1] GSM524662 GSM524663 GSM524664 GSM524665 GSM524666 GSM524667 GSM524668 GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 ID
<0 行> (或0-长度的row.names)
[1] 0 13
没有重复,直接进行注释
> biocLite("hgu133plus2.db")
> library(hgu133plus2.db)
> library(annotate)
# 利用getSYMBOL函数进行注释,这里只选择后几列额,方便观察
> exprmt$symbol <- getSYMBOL(exprmt$ID, 'hgu133plus2');exprmt <- exprmt[,8:14]; head(exprmt)
GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 ID symbol
1007_s_at 2.456892 2.404267 3.568920 3.307409 3.225964 1007_s_at <NA>
1053_at 10.447158 10.799832 9.765222 10.069229 10.089496 1053_at RFC2
117_at 4.500872 2.702614 3.015139 2.768683 2.853809 117_at HSPA6
121_at 3.057630 3.072568 3.057630 3.057630 3.057630 121_at PAX8
1255_g_at 2.278998 2.278998 2.278998 2.278998 2.278998 1255_g_at GUCA1A
1294_at 4.978118 4.713441 4.303196 4.452989 4.766019 1294_at <NA>
可以看到有一些芯片ID是没有注释上的,接下来对注释情况进行检查,主要包括三个方面:(1)未注释上--NA;(2)未注释上--空字符串;(3)注释上了,但有重复。
# 查看一共多少行
> dim(exprmt)
[1] 54675 7
# symbol非NA的行
> exprmt.nona <- exprmt[!is.na(exprmt$symbol),]; dim(exprmt.nona)
[1] 41941 7
> dim(exprmt[is.na(exprmt$symbol),])
[1] 12734 7
# symbol重复的行数
> symbol.dup <- exprmt.nona[duplicated(exprmt.nona$symbol),]; head(symbol.dup)
GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 ID symbol
1552264_a_at 8.016109 7.672156 7.661139 7.835451 7.225329 1552264_a_at MAPK1
1552272_a_at 2.279247 2.279247 2.279247 2.279247 2.279247 1552272_a_at PRR22
1552275_s_at 4.862782 5.539661 4.672476 4.709687 3.931607 1552275_s_at PXK
1552279_a_at 2.436048 2.731219 2.436048 2.436048 2.464805 1552279_a_at SLC46A1
1552289_a_at 2.278998 2.278998 2.278998 2.278998 2.278998 1552289_a_at CILP2
1552303_a_at 4.421032 4.601013 4.614977 4.628660 4.628942 1552303_a_at TMEM106A
> dim(symbol.dup)
[1] 21749 7
> exprmt.nona[exprmt.nona$symbol=='PXK',]
GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 ID symbol
1552274_at 5.290347 5.539950 5.081292 5.016670 5.032546 1552274_at PXK
1552275_s_at 4.862782 5.539661 4.672476 4.709687 3.931607 1552275_s_at PXK
225796_at 7.140168 7.106552 7.232615 6.650670 7.167075 225796_at PXK
238223_at 2.281464 2.281464 2.394397 2.281464 2.281464 238223_at PXK
# 最后看看有没有空字符串
> exprmt.nona[exprmt.nona$symbol=='',]
[1] GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 ID symbol
<0 行> (或0-长度的row.names)
# 最后得到非重复的,无NA的数据
> exprmt.final <- exprmt.nona[!duplicated(exprmt.nona$symbol),]; dim(exprmt.final)
[1] 20192 7
> exprmt.final[exprmt.final$symbol=='PXK',]
GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 ID symbol
1552274_at 5.290347 5.53995 5.081292 5.01667 5.032546 1552274_at PXK
从上面可以看到,我们的芯片数据共有54675个features,其中注释到gene symbol的为41941个,在这些注释到的features中,有21749个是重复的。无空字符串gene symbol。最后得到非重复数据20192个(重复的则取其第一个出现的)。
下面我们下载芯片平台对应信息文件手动进行注释:
# 下载芯片对应信息
> gpl570 <- getGEO('GPL570', destdir=".")
> names(Meta(gpl570))
[1] "contact_address" "contact_city" "contact_country" "contact_email" "contact_institute" "contact_name"
[7] "contact_phone" "contact_state" "contact_web_link" "contact_zip/postal_code" "data_row_count" "description"
[13] "distribution" "geo_accession" "last_update_date" "manufacture_protocol" "manufacturer" "organism"
[19] "relation" "sample_id" "series_id" "status" "submission_date" "taxid"
[25] "technology" "title" "web_link"
> colnames(Table(gpl570))
[1] "ID" "GB_ACC" "SPOT_ID" "Species Scientific Name"
[5] "Annotation Date" "Sequence Type" "Sequence Source" "Target Description"
[9] "Representative Public ID" "Gene Title" "Gene Symbol" "ENTREZ_GENE_ID"
[13] "RefSeq Transcript ID" "Gene Ontology Biological Process" "Gene Ontology Cellular Component" "Gene Ontology Molecular Function"
# 选取其中的几列用来转换
> iddata <- Table(gpl570)[c('ID','Gene Symbol')]
> head(iddata)
ID Gene Symbol
1 1007_s_at DDR1 /// MIR4640
2 1053_at RFC2
3 117_at HSPA6
4 121_at PAX8
5 1255_g_at GUCA1A
6 1294_at MIR5193 /// UBA7
我们选取gpl信息中的ID、Gene Symbol用于转换。
> mann.anno <- merge(x=exprmt, y=iddata, by='ID',all.x=T,all.y=F); head(mann.anno); dim(mann.anno)
ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP Gene Symbol
1 1007_s_at 2.456892 2.404267 3.568920 3.307409 3.225964 <NA> DDR1 /// MIR4640
2 1053_at 10.447158 10.799832 9.765222 10.069229 10.089496 RFC2 RFC2
3 117_at 4.500872 2.702614 3.015139 2.768683 2.853809 HSPA6 HSPA6
4 121_at 3.057630 3.072568 3.057630 3.057630 3.057630 PAX8 PAX8
5 1255_g_at 2.278998 2.278998 2.278998 2.278998 2.278998 GUCA1A GUCA1A
6 1294_at 4.978118 4.713441 4.303196 4.452989 4.766019 <NA> MIR5193 /// UBA7
可以看到用R包注释的(symbol.RP列)和直接下载平台信息注释的有些许差别,R包未能注释到的,手动注释结果是对应多个信息。下面我们进一步查看一下其区别:
> library(dplyr)
> consistent <- filter(mann.anno, symbol.RP==`Gene Symbol`); head(consistent)
ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP Gene Symbol
1 1053_at 10.447158 10.799832 9.765222 10.069229 10.089496 RFC2 RFC2
2 117_at 4.500872 2.702614 3.015139 2.768683 2.853809 HSPA6 HSPA6
3 121_at 3.057630 3.072568 3.057630 3.057630 3.057630 PAX8 PAX8
4 1255_g_at 2.278998 2.278998 2.278998 2.278998 2.278998 GUCA1A GUCA1A
5 1316_at 5.246463 5.547843 5.269052 4.794183 4.620719 THRA THRA
6 1320_at 4.578203 4.446357 4.492585 4.502381 4.446357 PTPN21 PTPN21
> dim(mann.anno); dim(consistent)
[1] 54675 8
[1] 39145 8
# 可以看到一共有39145行注释结果都是相同的,那么这些中有多少是都未注释到的呢(即NA)
> consistent.na <- filter(consistent, is.na(symbol.RP));head(consistent.na)
[1] ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP Gene Symbol
<0 行> (或0-长度的row.names)
# 结果是这些相同的中没有NA
# 再来看看注释到重复gene的情况吧
> dup <- consistent[duplicated(consistent$symbol.RP),]
> head(dup);dim(dup)
ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP Gene Symbol
16 1552264_a_at 8.016109 7.672156 7.661139 7.835451 7.225329 MAPK1 MAPK1
20 1552272_a_at 2.279247 2.279247 2.279247 2.279247 2.279247 PRR22 PRR22
22 1552275_s_at 4.862782 5.539661 4.672476 4.709687 3.931607 PXK PXK
26 1552279_a_at 2.436048 2.731219 2.436048 2.436048 2.464805 SLC46A1 SLC46A1
32 1552289_a_at 2.278998 2.278998 2.278998 2.278998 2.278998 CILP2 CILP2
40 1552303_a_at 4.421032 4.601013 4.614977 4.628660 4.628942 TMEM106A TMEM106A
[1] 20267 8
# 因此最后非重复的
> dim(consistent[!duplicated(consistent$symbol.RP),])
[1] 18878 8
对两种注释结果相同的 gene symbol 进行分析,一共得到了39145个注释上的gene,这其中有20267个是重复的,因此最后得到了18878个非重复注释的gene。
我们最后再对手动注释的结果进行分析一下。
# 看看为注释到的情况(NA)
> mann.anno[is.na(mann.anno$`Gene Symbol`),]
[1] ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP Gene Symbol
<0 行> (或0-长度的row.names)
# 没有注释不上的,看来R包注释不上的这里都注释到多个gene symbol了(如DDR1 /// MIR4640)
> library(stringr)
> multi.anno <- mann.anno[str_detect(mann.anno$`Gene Symbol`, '///'),]; head(multi.anno); dim(multi.anno)
ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP Gene Symbol
1 1007_s_at 2.456892 2.404267 3.568920 3.307409 3.225964 <NA> DDR1 /// MIR4640
6 1294_at 4.978118 4.713441 4.303196 4.452989 4.766019 <NA> MIR5193 /// UBA7
16 1552258_at 2.437344 2.458223 2.437344 2.437344 2.429044 CYTOR LINC00152 /// LOC101930489
32 1552283_s_at 2.278998 2.278998 2.278998 2.281762 2.278998 <NA> ZDHHC11 /// ZDHHC11B
97 1552388_at 2.550713 2.389557 2.389557 2.389557 2.389557 <NA> FLJ30901 /// SCUBE1
108 1552401_a_at 2.278998 2.278998 2.278998 2.294001 2.278998 <NA> BACH1 /// GRIK1-AS2
[1] 2796 8
# 这里2796个features注释到多gene symbol,但意外的是,有的多gene symbol位点R包也注释到了
# 看看dup情况怎么样
> mann.anno.rmdup <- mann.anno[!duplicated(mann.anno$`Gene Symbol`),]; head(mann.anno.rmdup); dim(mann.anno.rmdup)
ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP Gene Symbol
1 1007_s_at 2.456892 2.404267 3.568920 3.307409 3.225964 <NA> DDR1 /// MIR4640
2 1053_at 10.447158 10.799832 9.765222 10.069229 10.089496 RFC2 RFC2
3 117_at 4.500872 2.702614 3.015139 2.768683 2.853809 HSPA6 HSPA6
4 121_at 3.057630 3.072568 3.057630 3.057630 3.057630 PAX8 PAX8
5 1255_g_at 2.278998 2.278998 2.278998 2.278998 2.278998 GUCA1A GUCA1A
6 1294_at 4.978118 4.713441 4.303196 4.452989 4.766019 <NA> MIR5193 /// UBA7
[1] 23521 8
从上面可以看到所有芯片ID都注释到了,没有NA,其中没有重复的有23521个(其中2796个是多gene symbol,因此单个gene symbol 20725个),较R包注释到的(20276)要多一些。
ID转换
接下来我们将上面注释到的Gene symbol转换为Entrez ID,使用的是HGNC提供的ID对应文件。
> hgnc <- read.csv('D:/TCGA/hgnc_complete_set.txt',header = T,stringsAsFactors = F,sep='\t');colnames(hgnc)
[1] "hgnc_id" "symbol" "name" "locus_group" "locus_type" "status"
[7] "location" "location_sortable" "alias_symbol" "alias_name" "prev_symbol" "prev_name"
[13] "gene_family" "gene_family_id" "date_approved_reserved" "date_symbol_changed" "date_name_changed" "date_modified"
[19] "entrez_id" "ensembl_gene_id" "vega_id" "ucsc_id" "ena" "refseq_accession"
[25] "ccds_id" "uniprot_ids" "pubmed_id" "mgd_id" "rgd_id" "lsdb"
[31] "cosmic" "omim_id" "mirbase" "homeodb" "snornabase" "bioparadigms_slc"
[37] "orphanet" "pseudogene.org" "horde_id" "merops" "imgt" "iuphar"
[43] "kznf_gene_catalog" "mamit.trnadb" "cd" "lncrnadb" "enzyme_id" "intermediate_filament_db"
[49] "rna_central_ids"
> destid <- hgnc[c('symbol','entrez_id','ensembl_gene_id')];head(destid)
symbol entrez_id ensembl_gene_id
1 A1BG 1 ENSG00000121410
2 A1BG-AS1 503538 ENSG00000268895
3 A1CF 29974 ENSG00000148584
4 A2M 2 ENSG00000175899
5 A2M-AS1 144571 ENSG00000245105
6 A2ML1 144568 ENSG00000166535
# ID 转换
> trans <- merge(x=exprmt[!duplicated(exprmt$symbol.RP),], y=destid, by.x='symbol.RP',by.y='symbol');cat('exprmt.rmdup\n');dim(exprmt[!duplicated(exprmt$symbol.RP),]);cat('destid\n');dim(destid);cat('trans\n');dim(trans)
exprmt.rmdup
[1] 20193 7
destid
[1] 42508 3
trans
[1] 19062 9
> trans.rmdup <- trans[!duplicated(trans$symbol.RP),]; dim(trans.rmdup)
[1] 19061 9
> trans.rmdup <- trans[!duplicated(trans$entrez_id),]; dim(trans.rmdup)
[1] 19060 9
可以看到,使用HGNC提供的ID转换文件,我们共得到了19062个无重复ID,比上面注释到的少了快2000个,对于这些差异基因有机会在探究一下。