47.《Bioinformatics Data Skills》之
2021-07-16 本文已影响0人
DataScience
GRangesList对象
与R的list
概念类似,GRangesList
是GRanges
对象集合的容器:
> gr1 <- GRanges(c("chr1", "chr2"), IRanges(start=c(32, 95), width=c(24, 123)))
> gr2 <- GRanges(c("chr8", "chr2"), IRanges(start=c(27, 12), width=c(42, 34)))
> grl <- GRangesList(gr1, gr2)
> grl
GRangesList object of length 2:
[[1]]
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 32-55 *
[2] chr2 95-217 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
[[2]]
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr8 27-68 *
[2] chr2 12-45 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
多个GRangesList
可以通过c()
函数合并:
> double_grl <- c(grl, grl)
> double_grl
GRangesList object of length 4:
[[1]]
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 32-55 *
[2] chr2 95-217 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
[[2]]
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr8 27-68 *
[2] chr2 12-45 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
[[3]]
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 32-55 *
[2] chr2 95-217 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
[[4]]
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr8 27-68 *
[2] chr2 12-45 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
使用unlist
可以合并并返回GRanges
对象:
> unlist(double_grl)
GRanges object with 8 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 32-55 *
[2] chr2 95-217 *
[3] chr8 27-68 *
[4] chr2 12-45 *
[5] chr1 32-55 *
[6] chr2 95-217 *
[7] chr8 27-68 *
[8] chr2 12-45 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
GRangesList提取
类似地,GRangesList
使用单方括号返回GRangesList
对象,双方括号返回GRanges
对象:
> grl[[2]]
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr8 27-68 *
[2] chr2 12-45 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
提取函数(例如seqnames
, start
, width
等)可以直接用在GRangesList
对象上:
> seqnames(grl)
RleList of length 2
[[1]]
factor-Rle of length 2 with 2 runs
Lengths: 1 1
Values : chr1 chr2
Levels(3): chr1 chr2 chr8
[[2]]
factor-Rle of length 2 with 2 runs
Lengths: 1 1
Values : chr8 chr2
Levels(3): chr1 chr2 chr8
> start(grl)
IntegerList of length 2
[[1]] 32 95
[[2]] 27 12
返回值为Rle
对象与整数的list
(RleList
和IntegerList
),这种特殊的数据结构存在补充的操作方式,可以自行了解一下。
split
split
函数用于分组,通过一个例子来说明,例如我们创建如下的GRanges
对象:
> chrs <- c("chr3", "chr1", "chr2", "chr2", "chr3", "chr1")
> gr <- GRanges(chrs, IRanges(sample(1:100, 6, replace=TRUE),
+ width=sample(3:30, 6, replace=TRUE)))
> head(gr)
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr3 23-45 *
[2] chr1 44-53 *
[3] chr2 44-52 *
[4] chr2 41-55 *
[5] chr3 73-98 *
[6] chr1 13-20 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
我们知道使用split
函数可以对data.frame
对象根据某一列进行分组,同样可以用于GRanges
分组:
> gr_split <- split(gr, seqnames(gr))
> gr_split
GRangesList object of length 3:
$chr3
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr3 23-45 *
[2] chr3 73-98 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
$chr1
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 44-53 *
[2] chr1 13-20 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
$chr2
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 44-52 *
[2] chr2 41-55 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
> names(gr_split)
[1] "chr3" "chr1" "chr2"
补充:已分组的对象,可以使用unsplit(gr_split, seqnames(gr))
逆向操作。
分组的意义在于:
- 方便地提取自己感兴趣的对象,例如通过基因分组, 可以直接获取某个基因的所有外显子区域;
- 分组还可以方便我们使用
apply
函数对每组对象进行操作,例如:
返回每个染色体上的范围的排序值(从小到大):
> lapply(gr_split, function(x) order(width(x)))
$chr3
[1] 1 2
$chr1
[1] 2 1
$chr2
[1] 1 2
确定每条染色体上范围的起始位点:
> sapply(gr_split, function(x) min(start(x)))
chr3 chr1 chr2
23 13 41
确定每条染色体上的范围数目:
> sapply(gr_split, function(x) length(x))
chr3 chr1 chr2
2 2 2
此外,一些函数直接对GRangesList
进行操作(包括reduce
, flank
, coverage
, findOverlaps
等),例如使用reduce
合并每个分组(即每条染色体)上重叠的范围:
> reduce(gr_split)
GRangesList object of length 3:
$chr3
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr3 23-45 *
[2] chr3 73-98 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
$chr1
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 13-20 *
[2] chr1 44-53 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
$chr2
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 41-55 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
类似地,使用findOverlaps
可以返回每个分组内的重叠区域,后面通过注释数据更具体的说明。