GenomicFeatures: 用R愉快检索数据库PLUS
😑
导入及检索 Gene Model 的便捷工具
Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan M, Carey V (2013). “Software for Computing and Annotating Genomic Ranges.” PLoS Computational Biology, 9. doi: 10.1371/journal.pcbi.1003118,
A Gene Model is defined as any description of a gene product from a variety of sources including computational prediction, mRNA sequencing, or genetic characterization.
1. 关于 GenomicFeatures
GenomicFeatures
可以从 UCSC Genome Bioinformatics 及 BioMart 等数据库检索并获取转录本相关信息并对其进行操作,有助于 ChIP-chip, ChIP-seq, RNA-seq 结果分析。
GenomicFeatures
利用 TxDb
对象储存转录本元数据,包括 5', 3' UTRs, CDS, 外显子等。同时 TxDb
对象有大量访问函数可获取这些信息。
library(GenomicFeatures)
ls("package:GenomicFeatures")
# [1] "as.list" "asBED"
# [3] "asGFF" "browseUCSCtrack"
# [5] "cds" "cdsBy"
# [7] "cdsByOverlaps" "coverageByTranscript"
# [9] "DEFAULT_CIRC_SEQS" "disjointExons"
# [11] "distance" "exonicParts"
# [13] "exons" "exonsBy"
# [15] "exonsByOverlaps" "extractTranscriptSeqs"
# [17] "extractUpstreamSeqs" "features"
# [19] "fiveUTRsByTranscript" "genes"
# [21] "getChromInfoFromBiomart" "getChromInfoFromUCSC"
# [23] "getPromoterSeq" "id2name"
# [25] "intronicParts" "intronsByTranscript"
# [27] "isActiveSeq" "isActiveSeq<-"
# [29] "makeFDbPackageFromUCSC" "makeFeatureDbFromUCSC"
# [31] "makePackageName" "makeTxDb"
# [33] "makeTxDbFromBiomart" "makeTxDbFromEnsembl"
# [35] "makeTxDbFromGFF" "makeTxDbFromGRanges"
# [37] "makeTxDbFromUCSC" "makeTxDbPackage"
# [39] "makeTxDbPackageFromBiomart" "makeTxDbPackageFromUCSC"
# [41] "mapFromTranscripts" "mapIdsToRanges"
# [43] "mapRangesToIds" "mapToTranscripts"
# [45] "microRNAs" "organism"
# [47] "pcoverageByTranscript" "pmapFromTranscripts"
# [49] "pmapToTranscripts" "promoters"
# [51] "seqinfo" "seqlevels<-"
# [53] "seqlevels0" "show"
# [55] "species" "supportedMiRBaseBuildValues"
# [57] "supportedUCSCFeatureDbTables" "supportedUCSCFeatureDbTracks"
# [59] "supportedUCSCtables" "threeUTRsByTranscript"
# [61] "tidyExons" "tidyIntrons"
# [63] "tidyTranscripts" "transcriptLengths"
# [65] "transcriptLocs2refLocs" "transcripts"
# [67] "transcriptsBy" "transcriptsByOverlaps"
# [69] "transcriptWidths" "tRNAs"
# [71] "UCSCFeatureDbTableSchema"
2. 从 TxDb
获取数据
2.1 加载转录本数据
loadDb()
- 直接加载 TxDb 注释包
samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb_1 <- loadDb(samplefile)
txdb_1
# TxDb object:
# # Db type: TxDb
# # Supporting package: GenomicFeatures
# # Data source: UCSC
# # Genome: hg19
# # Organism: Homo sapiens
# # UCSC Table: knownGene
# # Resource URL: http://genome.ucsc.edu/
# # Type of Gene ID: Entrez Gene ID
# # Full dataset: no
# # miRBase build ID: NA
# # transcript_nrow: 178
# # exon_nrow: 620
# # cds_nrow: 523
# # Db created by: GenomicFeatures package from Bioconductor
# # Creation time: 2014-10-08 10:31:15 -0700 (Wed, 08 Oct 2014)
# # GenomicFeatures version at creation time: 1.17.21
# # RSQLite version at creation time: 0.11.4
# # DBSCHEMAVERSION: 1.0
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb
# TxDb object:
# # Db type: TxDb
# # Supporting package: GenomicFeatures
# # Data source: UCSC
# # Genome: hg19
# # Organism: Homo sapiens
# # Taxonomy ID: 9606
# # UCSC Table: knownGene
# # Resource URL: http://genome.ucsc.edu/
# # Type of Gene ID: Entrez Gene ID
# # Full dataset: yes
# # miRBase build ID: GRCh37
# # transcript_nrow: 82960
# # exon_nrow: 289969
# # cds_nrow: 237533
# # Db created by: GenomicFeatures package from Bioconductor
# # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
# # GenomicFeatures version at creation time: 1.21.30
# # RSQLite version at creation time: 1.0.0
# # DBSCHEMAVERSION: 1.1
2.2 基于染色体的数据预筛选
seqlevels()
查看染色体:
head(seqlevels(txdb))
# [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
通过赋值选择染色体:
seqlevels(txdb) <- "chr1"
如果需要设置回全部染色体的状态,使用函数 seqlevels0()
:
seqlevels(txdb) <- seqlevels0(txdb)
2.3 数据筛选
AnnotationDbi
包内的 columns()
, keytypes()
, keys()
, select()
同样适用于 TxDb
对象。
## GENEID → TXNAME
keys <- c("100033416", "100033417", "100033420")
columns(txdb)
# [1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME"
# [5] "CDSSTART" "CDSSTRAND" "EXONCHROM" "EXONEND"
# [9] "EXONID" "EXONNAME" "EXONRANK" "EXONSTART"
# [13] "EXONSTRAND" "GENEID" "TXCHROM" "TXEND"
# [17] "TXID" "TXNAME" "TXSTART" "TXSTRAND"
# [21] "TXTYPE"
keytypes(txdb)
# [1] "CDSID" "CDSNAME" "EXONID" "EXONNAME" "GENEID"
# [6] "TXID" "TXNAME"
select(txdb,
keys = keys,
columns = "TXNAME",
keytype = "GENEID")
# 'select()' returned 1:1 mapping between keys and columns
# GENEID TXNAME
# 1 100033416 uc001yxl.4
# 2 100033417 uc001yxo.3
# 3 100033420 uc001yxr.3
## GENEID → TXNAME + TXTSTRAND + TXCHROM
cols <- c("TXNAME", "TXSTRAND", "TXCHROM")
select(txdb,
keys = keys,
columns = cols,
keytype = "GENEID")
# 'select()' returned 1:1 mapping between keys and columns
# GENEID TXNAME TXCHROM TXSTRAND
# 1 100033416 uc001yxl.4 chr15 +
# 2 100033417 uc001yxo.3 chr15 +
# 3 100033420 uc001yxr.3 chr15 +
3. 主要函数及示例
Extract basic quantities
genes()
transcripts()
cds()
exons()
microRNAs()
tRNAs()
promoters()
Extract quantities and group
transcriptsBy(by = c("gene", "exon", "cds"))
cdsBy(by = c("tx", "gene"))
exonsBy(by = c("tx", "gene"))
intronsByTranscript()
fiveUTRsByTranscript()
threeUTRsByTranscript()
(Note: there are grouping functions without the non-grouping function and vice versa; there is for example no
introns()
function.Other functions
transcriptLengths()
(optionally include CDS length etc).XXByOverlaps()
(select features based on overlaps withXX
beingtranscript
,cds
orexon
).Mapping between genome and transcript coordinates
extractTranscriptSeqs()
(getting RNA sequencing of the transcripts).
3.1 返回 GRanges
对象
除了2.3 的数据筛选方法,直接提取出 GRanges
对象的方法可能更方便。"Extract basic quantities" 下的所有函数都可返回含有基因组坐标、外显子区间、转录本、编码序列等信息的 GRanges
对象。
3.1.1 genes()
GE <- genes(txdb)
# [1] "GRanges"
# attr(,"package")
# [1] "GenomicRanges"
GE[1:5]
# GRanges object with 5 ranges and 1 metadata column:
# seqnames ranges strand | gene_id
# <Rle> <IRanges> <Rle> | <character>
# 1 chr19 58858172-58874214 - | 1
# 10 chr8 18248755-18258723 + | 10
# 100 chr20 43248163-43280376 - | 100
# 1000 chr18 25530930-25757445 - | 1000
# 10000 chr1 243651535-244006886 - | 10000
# -------
# seqinfo: 93 sequences (1 circular) from hg19 genome
3.1.2 transcripts()
TC <- transcripts(txdb)
TC[1:3]
# GRanges object with 3 ranges and 2 metadata columns:
# seqnames ranges strand | tx_id tx_name
# <Rle> <IRanges> <Rle> | <integer> <character>
# [1] chr1 11874-14409 + | 1 uc001aaa.3
# [2] chr1 11874-14409 + | 2 uc010nxq.1
# [3] chr1 11874-14409 + | 3 uc010nxr.1
# -------
# seqinfo: 93 sequences (1 circular) from hg19 genome
## strand information
strand(TC)
# factor-Rle of length 82960 with 94 runs
# Lengths: 4073 3894 2642 2450 2179 ... 2 1 2 2 1
# Values : + - + - + ... - + - + -
# Levels(3): + - *
length(TC)
# 82960
用 filter=
进一步筛选:
TC_select <- transcripts(txdb,filter = list(tx_chrom = "chr15",
tx_strand = "+"))
length(TC_select)
# [1] 1732
3.1.3 promoters()
PR <- promoters(txdb, upstream = 2000, downstream = 400)
PR[1:3]
# GRanges object with 3 ranges and 2 metadata columns:
# seqnames ranges strand | tx_id tx_name
# <Rle> <IRanges> <Rle> | <integer> <character>
# uc001aaa.3 chr1 9874-12273 + | 1 uc001aaa.3
# uc010nxq.1 chr1 9874-12273 + | 2 uc010nxq.1
# uc010nxr.1 chr1 9874-12273 + | 3 uc010nxr.1
# -------
# seqinfo: 93 sequences (1 circular) from hg19 genome
3.1.4 另一种方法
gr <- GRanges(seqnames = "chr1",
strand = "+",
ranges = IRanges(start = 11874, end = 14409))
subsetByOverlaps(exons(txdb), gr)
# GRanges object with 6 ranges and 1 metadata column:
# seqnames ranges strand | exon_id
# <Rle> <IRanges> <Rle> | <integer>
# [1] chr1 11874-12227 + | 1
# [2] chr1 12595-12721 + | 2
# [3] chr1 12613-12721 + | 3
# [4] chr1 12646-12697 + | 4
# [5] chr1 13221-14409 + | 5
# [6] chr1 13403-14409 + | 6
# -------
# seqinfo: 93 sequences (1 circular) from hg19 genome
当然中间的 exons()
可以是任意一个"Extract basic quantities" 下的函数。
3.2 特征分组
Extract quantities and group 下的函数可返回 GRangesList
对象,每个函数都有特定的 by=
参数选项。
TC_bygene <- transcriptsBy(txdb, by = "gene")
class(TC_bygene)
# [1] "CompressedGRangesList"
# attr(,"package")
# [1] "GenomicRanges"
length(TC_bygene)
# [1] 23459
names(TC_bygene)[10:13]
# [1] "10003" "100033413" "100033414" "100033415"
TC_bycds <- transcriptsBy(txdb, by = "cds")
length(TC_bycds)
# [1] 237533
names(TC_bycds)[10:13]
# [1] "10" "11" "12" "13"
TC_bycds[1:3]
# GRangesList object of length 3:
# $1
# GRanges object with 1 range and 3 metadata columns:
# seqnames ranges strand | tx_id tx_name exon_rank
# <Rle> <IRanges> <Rle> | <integer> <character> <integer>
# [1] chr1 11874-14409 + | 2 uc010nxq.1 1
#
# $2
# GRanges object with 1 range and 3 metadata columns:
# seqnames ranges strand | tx_id tx_name exon_rank
# [1] chr1 11874-14409 + | 2 uc010nxq.1 2
#
# $3
# GRanges object with 1 range and 3 metadata columns:
# seqnames ranges strand | tx_id tx_name exon_rank
# [1] chr1 11874-14409 + | 2 uc010nxq.1 3
#
# -------
# seqinfo: 93 sequences (1 circular) from hg19 genome
同样的,另一种方法:
gr <- GRanges(seqnames = "chr1",
strand = "+",
ranges = IRanges(start = 11874, end = 14409))
subsetByOverlaps(cdsBy(txdb, by = "tx"), gr)
# GRangesList object of length 1:
# $2
# GRanges object with 3 ranges and 3 metadata columns:
# seqnames ranges strand | cds_id cds_name exon_rank
# <Rle> <IRanges> <Rle> | <integer> <character> <integer>
# [1] chr1 12190-12227 + | 1 <NA> 1
# [2] chr1 12595-12721 + | 2 <NA> 2
# [3] chr1 13403-13639 + | 3 <NA> 3
#
# -------
# seqinfo: 93 sequences (1 circular) from hg19 genome
3.3 一步获得非编码区及内含子的长度
intronsByTranscript()
, fiveUTRsByTranscript()
, threeUTRsByTranscript()
是分组函数的“预先设定”版。
intronsByTranscript(txdb)
# GRangesList object of length 82960:
# $1
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 12228-12612 +
# [2] chr1 12722-13220 +
#
# $2
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# [1] chr1 12228-12594 +
# [2] chr1 12722-13402 +
#
# $3
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# [1] chr1 12228-12645 +
# [2] chr1 12698-13220 +
#
# ...
# <82957 more elements>
# -------
# seqinfo: 93 sequences (1 circular) from hg19 genome
length(intronsByTranscript(txdb))
# [1] 82960
length(fiveUTRsByTranscript(txdb))
# [1] 61495
length(threeUTRsByTranscript(txdb))
# [1] 60740
4. 获取序列
当和合适的 Bsgenome
包配合使用,GenomicFeatures
也可以直接获取到序列数据。
library(BSgenome.Hsapiens.UCSC.hg19)
获取转录本序列:
tx_seqs1 <- extractTranscriptSeqs(Hsapiens,
TxDb.Hsapiens.UCSC.hg19.knownGene,
use.names=TRUE)
tx_seqs1
# A DNAStringSet instance of length 82960
# width seq names
# [1] 1652 CTTGCCGTCAGCCTTTT...ACACTGTTGGTTTCTG uc001aaa.3
# [2] 1488 CTTGCCGTCAGCCTTTT...ACACTGTTGGTTTCTG uc010nxq.1
# [3] 1595 CTTGCCGTCAGCCTTTT...ACACTGTTGGTTTCTG uc010nxr.1
# [4] 918 ATGGTGACTGAATTCAT...TAGTGTAAAGTTTTAG uc001aal.1
# [5] 32 TACAGACCAAGCTCATGACTCACAATGGCCTA uc001aaq.2
# ... ... ...
# [82956] 1217 GCCAGTTTAGGGTCTCT...TCCGCCTCCTGAGATC uc011mgu.1
# [82957] 737 CTCCACTTCTGATCCTC...TTTATTAGATGCAGTG uc011mgv.2
# [82958] 30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA uc011mgw.1
# [82959] 30 TTGCAGAGGTGGCTGGTTGCTCTTTGAGCC uc022brq.1
# [82960] 30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA uc022brr.1
还可以翻译:
translate(tx_seqs1)
# A AAStringSet instance of length 82960
# width seq names
# [1] 550 LAVSLFFDLFFLFMCIC...LHPAQLEILY*KHTVGF uc001aaa.3
# [2] 496 LAVSLFFDLFFLFMCIC...ASCTARDPLLKAHCWFL uc010nxq.1
# [3] 531 LAVSLFFDLFFLFMCIC...LHPAQLEILY*KHTVGF uc010nxr.1
# [4] 306 MVTEFIFLGLSDSQELQ...AIRQLRKWDAHSSVKF* uc001aal.1
# [5] 10 YRPSS*LTMA uc001aaq.2
# ... ... ...
# [82956] 405 ASLGSLVSPAELLCSRL...LRRASMATDHCNLRLLR uc011mgu.1
# [82957] 245 LHF*SSPPWCCSGSHTR...KWENGFLGLK*PLY*MQ uc011mgv.2
# [82958] 10 W*ISAKVAKE uc011mgw.1
# [82959] 10 MQRWLVAL*A uc022brq.1
# [82960] 10 W*ISAKVAKE uc022brr.1
References
-
Making and Utilizing TxDb Objects https://bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf
-
GenomicFeatures Man Pages https://kasperdanielhansen.github.io/genbioconductor/html/GenomicFeatures.html#examples
-
Using the Gene Search Results https://www.arabidopsis.org/help/helppages/generesu.jsp
最后,向大家隆重推荐生信技能树的一系列干货!
- 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
- B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
- 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw