单细胞分析tips

跨物种同源基因转换(biomaRt和homologene)202

2022-06-17  本文已影响0人  黄甫一

适用背景

当我们进行跨物种比较分析时会发现基因名对不上,或者找不到基因,这时候就需要进行物种间的同源基因转换。最常用的工具就是Ensembl的biomaRt,另外还查到NCBI的一个工具homologene,因此本文将详细讲解这两个包。

方法一 biomaRt

biomaRt是基于Ensembl数据库的R包,当然也可以在网页查询同源基因,但如果基因较多,物种较多,网页版就比较麻烦。

> listMarts()
               biomart                version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 106
2   ENSEMBL_MART_MOUSE      Mouse strains 106
3     ENSEMBL_MART_SNP  Ensembl Variation 106
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 106
> umart = useMart('ensembl')
> datalist = listDatasets(umart)
> head(datalist)
                       dataset                           description
1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
2     acalliptera_gene_ensembl      Eastern happy genes (fAstCal1.2)
3   acarolinensis_gene_ensembl       Green anole genes (AnoCar2.0v2)
4    acchrysaetos_gene_ensembl       Golden eagle genes (bAquChr1.2)
5    acitrinellus_gene_ensembl        Midas cichlid genes (Midas_v5)
6    amelanoleuca_gene_ensembl       Giant panda genes (ASM200744v2)
      version
1 ASM259213v1
2  fAstCal1.2
3 AnoCar2.0v2
4  bAquChr1.2
5    Midas_v5
6 ASM200744v2
> dim(datalist)
[1] 215   3
> searchDatasets(mart = umart, pattern = "Mouse")
                   dataset                  description  version
107  mmurinus_gene_ensembl Mouse Lemur genes (Mmur_3.0) Mmur_3.0
108 mmusculus_gene_ensembl         Mouse genes (GRCm39)   GRCm39
mouse <- useMart('ensembl',dataset = "mmusculus_gene_ensembl")
macaque <- useMart('ensembl',dataset = "mfascicularis_gene_ensembl")

如果需要获取单个物种的某些基因信息,getBM可以实现,但需要指定输入的格式和输出的格式。

getBM(attributes, filters = "", values = "", mart, curl = NULL, 
checkFilters = TRUE, verbose = FALSE, uniqueRows = TRUE, bmHeader = FALSE,
quote = "\"", useCache = TRUE)
> head(listFilters(mouse))
             name              description
1 chromosome_name Chromosome/scaffold name
2           start                    Start
3             end                      End
4      band_start               Band Start
5        band_end                 Band End
6          strand                   Strand
> dim(listFilters(mouse))
[1] 400   2
> searchFilters(mart = mouse, 'gene_name')
                 name                           description
53 external_gene_name             Gene Name(s) [e.g. mt-Tf]
98      wikigene_name WikiGene name(s) [e.g. 0610005C13Rik]
> dim(listAttributes(mouse))
[1] 2987    3
> head(listAttributes(mouse))
                           name                  description         page
1               ensembl_gene_id               Gene stable ID feature_page
2       ensembl_gene_id_version       Gene stable ID version feature_page
3         ensembl_transcript_id         Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5            ensembl_peptide_id            Protein stable ID feature_page
6    ensembl_peptide_id_version    Protein stable ID version feature_page
> searchAttributes(mart = mouse, 'ensembl_gene_id')
                        name            description         page
1            ensembl_gene_id         Gene stable ID feature_page
2    ensembl_gene_id_version Gene stable ID version feature_page
183          ensembl_gene_id         Gene stable ID    structure
184  ensembl_gene_id_version Gene stable ID version    structure
223          ensembl_gene_id         Gene stable ID     homologs
224  ensembl_gene_id_version Gene stable ID version     homologs
2884         ensembl_gene_id         Gene stable ID          snp
2885 ensembl_gene_id_version Gene stable ID version          snp
2942         ensembl_gene_id         Gene stable ID    sequences
2943 ensembl_gene_id_version Gene stable ID version    sequences
> getBM(attributes = c("ensembl_gene_id","external_gene_name"),filters = "external_gene_name", values = c("Gad1","Sst"),mart = mouse,uniqueRows = T)
     ensembl_gene_id external_gene_name
1 ENSMUSG00000070880               Gad1
2 ENSMUSG00000004366                Sst

以下来自另一篇博主简书的用法

### 根据entrez ID号来找
entrzID=c("672","1") ##定义entrez ID
getBM(attributes=c("entrezgene","external_gene_name","gene_biotype"), filters = "ensembl_gene_id_version", values =entrzID, mart=human,uniqueRows=TRUE)
### 通过染色体及起始终止坐标来挑选基因
getBM(c('affy_hg_u133_plus_2','ensembl_gene_id'), filters = c('chromosome_name','start','end'), values=list(16,1100000,1250000), mart=human)
###根据染色体位置(6p21.1)查找所有位于其上的基因
getBM(c('entrezgene','hgnc_symbol', 'transcript_biotype', 'chromosome_name','start_position','end_position', 'band'), filters = c('chromosome_name','band_start','band_end'), values=list(6,'p21.1','p21.1'), mart=human)

获取不同物种的同源基因

getLDS(attributes, filters = "", values = "", mart, attributesL,
filtersL = "", valuesL = "", martL, verbose = FALSE, uniqueRows = TRUE, 
bmHeader=TRUE)

getBM类似,参考物种还是原来的参数,而查询物种则需要在原参数名加上L,例如参考物种是attributes,则查询物种是attributesL
使用示例

gene.mo2ma <- getLDS(attributes = c("external_gene_name"),filters = "external_gene_name",values = c("Gad1","Sst"),mart = mouse,attributesL = c("external_gene_name","chromosome_name"),martL = macaque,uniqueRows = T)
> gene.mo2ma <- getLDS(attributes = c("external_gene_name"),filters = "external_gene_name",values = c("Gad1","Sst"),mart = mouse,attributesL = c("external_gene_name","chromosome_name"),martL = macaque,uniqueRows = T)
> head(gene.mo2ma)
  Gene.name Gene.name.1 Chromosome.scaffold.name
1       Sst         SST                        2
2      Gad1        GAD1                       12
> gene.mo2ma <- getLDS(attributes = c("external_gene_name"),filters = "external_gene_name",values = c("Gad1","Sst"),mart = mouse,attributesL = c("external_gene_name","chromosome_name"),martL = macaque,uniqueRows = T)
错误: biomaRt has encountered an unexpected server error.
Consider trying one of the Ensembl mirrors (for more details look at ?useEnsembl)

查了谷歌后发现是网页自身的问题,在构建数据集的时候需更换2021年版本的一个网页才能正常运行,估计是2022年版本的bug。
解决办法是是用host参数指定2021年版本网页。

macaque <- useMart("ensembl", dataset = "mfascicularis_gene_ensembl", host = "https://dec2021.archive.ensembl.org/") 

方法二 homologene

biomaRt不同,homologene是基于NCBI HomoloGene数据库的一个包,感兴趣可以进官网了解一下此数据库,因此,两者做出来的结果会有差异,建议可以互为补充。
HomoloGene最新更新时间是2014-04-09,只有21个物种,可能还是biomaRt更完善一些。

> genelist=c('Gad','Sst')
> homologene(genelist, inTax = 10090, outTax = 9606)
  10090 9606 10090_ID 9606_ID
1   Sst  SST    20604    6750
10090   Mus musculus
10116   Rattus norvegicus
28985   Kluyveromyces lactis
318829  Magnaporthe oryzae
33169   Eremothecium gossypii
3702    Arabidopsis thaliana
4530    Oryza sativa
4896    Schizosaccharomyces pombe
4932    Saccharomyces cerevisiae
5141    Neurospora crassa
6239    Caenorhabditis elegans
7165    Anopheles gambiae
7227    Drosophila melanogaster
7955    Danio rerio
8364    Xenopus (Silurana) tropicalis
9031    Gallus gallus
9544    Macaca mulatta
9598    Pan troglodytes
9606    Homo sapiens
9615    Canis lupus familiaris
9913    Bos taurus   

小结与补充

总的来说biomaRt更加完善,功能更强大,所以更推荐biomaRt。但是只能对两个物种进行转换,如果是多个物种进行转换,该怎么做呢?下一篇文章试试写一下2个及以上物种的转换函数吧。

上一篇 下一篇

猜你喜欢

热点阅读