生物信息数据科学

47.《Bioinformatics Data Skills》之

2021-07-16  本文已影响0人  DataScience

GRangesList对象

与R的list概念类似,GRangesListGRanges对象集合的容器:

> gr1 <- GRanges(c("chr1", "chr2"), IRanges(start=c(32, 95), width=c(24, 123)))
> gr2 <- GRanges(c("chr8", "chr2"), IRanges(start=c(27, 12), width=c(42, 34)))
> grl <- GRangesList(gr1, gr2)
> grl
GRangesList object of length 2:
[[1]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     32-55      *
  [2]     chr2    95-217      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

[[2]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr8     27-68      *
  [2]     chr2     12-45      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

多个GRangesList可以通过c()函数合并:

> double_grl <- c(grl, grl)
> double_grl
GRangesList object of length 4:
[[1]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     32-55      *
  [2]     chr2    95-217      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

[[2]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr8     27-68      *
  [2]     chr2     12-45      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

[[3]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     32-55      *
  [2]     chr2    95-217      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

[[4]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr8     27-68      *
  [2]     chr2     12-45      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

使用unlist可以合并并返回GRanges对象:

> unlist(double_grl)
GRanges object with 8 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     32-55      *
  [2]     chr2    95-217      *
  [3]     chr8     27-68      *
  [4]     chr2     12-45      *
  [5]     chr1     32-55      *
  [6]     chr2    95-217      *
  [7]     chr8     27-68      *
  [8]     chr2     12-45      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

GRangesList提取

类似地,GRangesList使用单方括号返回GRangesList对象,双方括号返回GRanges对象:

> grl[[2]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr8     27-68      *
  [2]     chr2     12-45      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

提取函数(例如seqnames, start, width等)可以直接用在GRangesList对象上:

> seqnames(grl)
RleList of length 2
[[1]]
factor-Rle of length 2 with 2 runs
  Lengths:    1    1
  Values : chr1 chr2
Levels(3): chr1 chr2 chr8

[[2]]
factor-Rle of length 2 with 2 runs
  Lengths:    1    1
  Values : chr8 chr2
Levels(3): chr1 chr2 chr8

> start(grl)
IntegerList of length 2
[[1]] 32 95
[[2]] 27 12

返回值为Rle对象与整数的listRleListIntegerList),这种特殊的数据结构存在补充的操作方式,可以自行了解一下。

split

split函数用于分组,通过一个例子来说明,例如我们创建如下的GRanges对象:

> chrs <- c("chr3", "chr1", "chr2", "chr2", "chr3", "chr1")
> gr <- GRanges(chrs, IRanges(sample(1:100, 6, replace=TRUE),
+ width=sample(3:30, 6, replace=TRUE)))
> head(gr)
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr3     23-45      *
  [2]     chr1     44-53      *
  [3]     chr2     44-52      *
  [4]     chr2     41-55      *
  [5]     chr3     73-98      *
  [6]     chr1     13-20      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

我们知道使用split函数可以对data.frame对象根据某一列进行分组,同样可以用于GRanges分组:

> gr_split <- split(gr, seqnames(gr))
> gr_split
GRangesList object of length 3:
$chr3
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr3     23-45      *
  [2]     chr3     73-98      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

$chr1
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     44-53      *
  [2]     chr1     13-20      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

$chr2
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2     44-52      *
  [2]     chr2     41-55      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  
> names(gr_split)
[1] "chr3" "chr1" "chr2"

补充:已分组的对象,可以使用unsplit(gr_split, seqnames(gr))逆向操作。

分组的意义在于:

  1. 方便地提取自己感兴趣的对象,例如通过基因分组, 可以直接获取某个基因的所有外显子区域;
  2. 分组还可以方便我们使用apply函数对每组对象进行操作,例如:

返回每个染色体上的范围的排序值(从小到大):

> lapply(gr_split, function(x) order(width(x)))
$chr3
[1] 1 2

$chr1
[1] 2 1

$chr2
[1] 1 2

确定每条染色体上范围的起始位点:

> sapply(gr_split, function(x) min(start(x)))
chr3 chr1 chr2
  23   13   41

确定每条染色体上的范围数目:

> sapply(gr_split, function(x) length(x))
chr3 chr1 chr2
   2    2    2

此外,一些函数直接对GRangesList进行操作(包括reduce, flank, coverage, findOverlaps等),例如使用reduce合并每个分组(即每条染色体)上重叠的范围:

> reduce(gr_split)
GRangesList object of length 3:
$chr3
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr3     23-45      *
  [2]     chr3     73-98      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

$chr1
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     13-20      *
  [2]     chr1     44-53      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

$chr2
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2     41-55      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

类似地,使用findOverlaps可以返回每个分组内的重叠区域,后面通过注释数据更具体的说明。

上一篇 下一篇

猜你喜欢

热点阅读