biomaRt:基因类型转换R包使用

2019-03-16 本文已影响114人 drlee_fc74

一个基因包括很多信息，同时也包括很多的ID。比如最基本的gene id而言。其在不同数据库里面就包括不同的ID。比如基因PTEN。它的Entrez ID是5728。而他的ENSID则是：ENSG00000171862。

写在前面：

关于基因类型转换的工具很多。常用的比如DAVID等等。这次说biomart是Ensembl旗下的一个基因所有类型转换工具。它不止呢进行基因ID的转换。也可以通过数据基因的类型得到基因其他相关信息。比如：基因序列呀，基因类型等等的。biomart有两个版本。一个是网页版。另外一个是R语言的包。这次介绍一下R包的使用。

biomaRt包的安装

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("biomaRt", version = "3.8")

基本操作

选择数据库

使用biomaRt包的第一步需要提前设置想要使用的数据库。这一步是通过 useMart来实现的。这个函数的默认参数为

    useMart(biomart, dataset, host="www.ensembl.org",
    path="/biomart/martservice", port=80, archive=FALSE, ssl.verifypeer =
    TRUE, ensemblRedirect = NULL, version, verbose = FALSE)

这个参数主要需要设置的参数是：

Biomart: 要连接的BioMart数据库名称。可以使用listMarts函数可以显示所有的数据库。

listMarts()
               biomart               version
 ENSEMBL_MART_ENSEMBL      Ensembl Genes 95
   ENSEMBL_MART_MOUSE      Mouse strains 95
     ENSEMBL_MART_SNP  Ensembl Variation 95
 ENSEMBL_MART_FUNCGEN Ensembl Regulation 95

dataset：进一步选择上一个数据库里面的子数据集：所有可以使用的参数可以通过 listDatasets函数来查看。使用方法是：
```
mart = useMart(’ensembl’)  ###制定选择的数据库
listDatasets(mart)  ###查看所有的子集
```
如果在所有的子集里面一个一个的查找很麻烦的话。我们也可以使用 searchDatasets函数来查找自己中，含有我们想要的关键词的具体类型是什么。PS：这个函数支持 正则表达式。
```
####查找人类的数据库
searchDatasets(mart = ensembl, pattern = "hsapiens")
                 dataset              description    version
58 hsapiens_gene_ensembl Human genes (GRCh38.p12) GRCh38.p12
```

这样在我知道想要使用的数据库和数据子集。如果有构建一个人的数据库的时候。就可以这样构建

ensembl = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl")

选择需要输入和输出的类型

在我们选择好数据库之后，需要通过get*函数来得到我们想要的结果。这个时候我们就需要选择输入的数据类型以及我们想要输出的数据类型。以确保得到我们想要的结果。我们可以通过 listFilters得到所有的输入结果； listAttributes得到所有的输出结果。同样的我们可以通过 searchFilters和 searchAttributes定向的搜索想要结果的关键词。

###输入结果
listFilters(ensembl)[1:5,]
             name              description
1 chromosome_name Chromosome/scaffold name
2           start                    Start
3             end                      End
4      band_start               Band Start
5        band_end                 Band End
###定向检索输入结果
searchFilters(mart = ensembl, 'entrez|hgnc')
                         name                                      description
17 with_entrezgene_trans_name            With EntrezGene transcript name ID(s)
23                  with_hgnc                           With HGNC Symbol ID(s)
24       with_hgnc_trans_name                  With HGNC transcript name ID(s)
34            with_entrezgene                             With NCBI gene ID(s)
70      entrezgene_trans_name EntrezGene transcript name ID(s) [e.g. AA06-201]
76                    hgnc_id                       HGNC ID(s) [e.g. HGNC:100]
77                hgnc_symbol                       HGNC symbol(s) [e.g. A1BG]
78            hgnc_trans_name       HGNC transcript name ID(s) [e.g. A1BG-201]
89                 entrezgene                         NCBI gene ID(s) [e.g. 1]
###输出结果
listAttributes(ensembl)[1:5,]
                           name                  description         page
1               ensembl_gene_id               Gene stable ID feature_page
2       ensembl_gene_id_version       Gene stable ID version feature_page
3         ensembl_transcript_id         Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5            ensembl_peptide_id            Protein stable ID feature_page

得到检索结果（getBM）

这个函数是最主要的函数。通过设定输入的类型和输出的类型。再指定输入的具体值。就可以得到具体的结果了。函数的主要参数包括：

 getBM(attributes, filters = "", values = "", mart, curl = NULL,
    checkFilters = TRUE, verbose = FALSE, uniqueRows = TRUE, bmHeader = FALSE,
    quote = "\"")

例如：

### 根据entrez ID号来找
entrzID=c("672","1") ##定义entrez ID
getBM(attributes=c("entrezgene","external_gene_name","gene_biotype"), filters = "ensembl_gene_id_version", values =entrzID, mart=ensembl )
### 通过染色体及起始终止坐标来挑选基因
getBM(c('affy_hg_u133_plus_2','ensembl_gene_id'), filters = c('chromosome_name','start','end'), values=list(16,1100000,1250000), mart=ensembl)
###根据染色体位置(6p21.1)查找所有位于其上的基因
getBM(c('entrezgene','hgnc_symbol', 'transcript_biotype', 'chromosome_name','start_position','end_position', 'band'), filters = c('chromosome_name','band_start','band_end'), values=list(6,'p21.1','p21.1'), mart=ensembl)

获得序列信息

出了可以得到基因类型的转换。也可以通过biomaRt得到相对应位置或者ID的基因序列。这个可以通过 getSequence得到序列结果。然后可以通过 exportFASTA导出fa格式的结果。

其中getSequence: 主要参数包括

getSequence(chromosome, start, end, id, type, seqType,
                       upstream, downstream, mart, verbose = FALSE)

type:输入的类型。主要的选项可以是 hugo, ensembl, embl, entrez-gene, refseq, ensemblTrans and unigene。更多的关于type的选项。也是可以在 listFilter函数中得到的
seqType：包括的主要选项有：
- ’cdna’: for nucleotide sequences
- ’peptide’: for protein sequences
- ’3utr’: for 3’ UTR sequences
- ’5utr’: for 5’ UTR sequences
- ’gene_exon’: for exon sequences only
- ’transcript_exon_intron’: gives the full unspliced transcript, that is exons + introns
- ’gene_exon_intron’ gives the exons + introns of a gene;’coding’ gives the coding sequence only
- ’coding_transcript_flank’: gives the flanking region of the transcript including the UTRs, this must be accompanied with a given value for the upstream or downstream attribute
- ’coding_gene_flank’: gives the flanking region of the gene including the UTRs, this must be accompanied with a given value for the upstream or downstream attribute
- ’transcript_flank’: gives the flanking region of the transcript exculding the UTRs, this must be accompanied with a given value for the upstream or downstream attribute
- ’gene_flank’: gives the flanking region of the gene excluding the UTRs, this must be accom- panied with a given value for the upstream or downstream attribute

例如：

  mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
  seq = getSequence(id = "BRCA1",
                      type = "hgnc_symbol",
                      seqType = "peptide",
                      mart = mart)
show(seq)
exportFASTA(seq,file="test.fasta")

得到多个物种的结果(getLDS)

getLDS可以得到多个物种之间的结果。

human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")
getLDS(attributes = c("hgnc_symbol","chromosome_name", "start_position"),
    filters = "hgnc_symbol", values = "TP53", mart = human,
    attributesL = c("chromosome_name","start_position"), martL = mouse)