生物信息学R语言源码

pyclone的输入格式:通过GenomicRanges准备输入

2020-07-29  本文已影响0人  果蝇的小翅膀

问题:

在准备pyclone的输入的时候,需要点突变和对应的拷贝数信息。点突变的信息可以通过varscan获得,而拷贝数的信息可以通过sequenza获得,但是如何整合点突变和拷贝数的信息?

附pyclone的输入格式:

Tsv files for pyclone:
-mutation_id - A unique ID to identify the mutation.
-ref_counts - The number of reads covering the mutation which contain the reference (genome) allele.
-var_counts - The number of reads covering the mutation which contain the variant allele.
-normal_cn - The copy number of the cells in the normal population. For autosomal chromosomes this will be 2 and for sex chromosomes it could be either 1 or 2.
-minor_cn - The minor copy number of the cancer cells. Usually
-major_cn - The major copy number of the cancer cells.

实例:

在Bioconductor中的 GenomicRanges这个包主要用来处理这样的问题,可以参考其中的说明。

  1. 首先准备实例数据
library(GenomicRanges)
library(magrittr)
#准备实例数据
con<- textConnection("chr     start    end   rs   ref_counts  var_counts
1           100          100      rs01   100      10
1           200          200      rs02   200      20
1           300          300      rs03   300      30")
snps <- read.delim(con, head=TRUE, sep="")

con <- textConnection("chr     start    end      minor_cn  major_cn
1           10          150      0     2
1           250          450      1     2
1           400          600      2     2")  
cnvs <- read.delim(con, head=TRUE, sep="")
准备数据
  1. 通过GRanges函数将数据读取到GRanges中
gsnps = GRanges(seqnames = snps$chr ,
               ranges = IRanges(snps$start , snps$end ),
               strand = "+" )
#metadata columns can be added to a GRanges object,将表格中的其他信息添加到gsnps中
mcols(gsnps) = snps

gcnvs = GRanges(seqnames = cnvs$chr,
               ranges = IRanges(cnvs$start , cnvs$end ),
               strand = "+" )
#metadata columns can be added to a GRanges object
mcols(gcnvs) = cnvs
  1. 通过findOverlaps寻找它们之间的交集。根据overlaps可以看到它们的交集情况,最后得到的merge就是pyclone的输入了。
overlaps =findOverlaps(gsnps, gcnvs )
overlaps
#Hits object with 2 hits and 0 metadata columns:
#      queryHits subjectHits
#     <integer>   <integer>
# [1]         1           1
# [2]         3           2
  -------
#queryLength: 3 / subjectLength: 3

#合并信息。
merge = cbind( mcols(gsnps[queryHits(overlaps), ]) , mcols(gcnvs[subjectHits(overlaps) ,]) )

#整合信息
merge = as.data.frame(merge) %>%
     dplyr::mutate(mutation_id = rs,
                normal_cn = 2,
                ) %>%
  dplyr::select(mutation_id, ref_counts, var_counts, normal_cn, minor_cn, major_cn)

merge
# mutation_id ref_counts var_counts normal_cn minor_cn major_cn
#1        rs01        100         10         2        0        2
#2        rs03        300         30         2        1        2

参考文献:

  1. pyclone的使用帮助
  2. GenomicRanges的帮助文档,主要参见 part4: Interval overlaps involving GRanges and GRangesList objects
  3. https://stackoverflow.com/questions/11892241/merge-by-range-in-r-applying-loops
上一篇下一篇

猜你喜欢

热点阅读