NGS

【preseq】 评估文库复杂性

2020-07-08  本文已影响0人  caokai001

问题:

前面看到这个图,描述随着测序深度逐渐增加,Repetition rate 逐渐增加。觉得自己进行library size 抽样,还是比较麻烦,samtools flagstat 计算distinct reads 数目也花时间。
看到提到使用preseq 来评估文库复杂性,于是来尝试一下。

https://www.nature.com/articles/ncomms7033

简单介绍图片内容:


安装

conda install preseq

主要有两个子命令。lc_extrap 和 c_curve,支持的输入文件包括bam 和bed 文件

$ preseq c_curve 
Usage: c_curve [OPTIONS] <sorted-bed-file>

Options:
  -o, -output   yield output file (default: stdout) 
  -s, -step     step size in extrapolations (default: 1e+06) 
  -v, -verbose  print more information 
  -P, -pe       input is paired end read file 
  -H, -hist     input is a text file containing the observed histogram 
  -V, -vals     input is a text file containing only the observed counts 
  -B, -bam      input is in BAM format 
  -l, -seg_len  maximum segment length when merging paired end bam reads 
                (default: 5000) 
  -r, -seed     seed for random number generator 

Help options:
  -?, -help     print this help message 
      -about    print about message

测试:

bamToBed -i trim_25.uniq.bam >./try/trim_25.uniq.bed
preseq c_curve -s 1e+03 -o estimates.txt trim_25.uniq.bed
estimates.txt.png
library(tidyverse)
data <- read_delim("estimates.txt",delim="\t")
ggplot(data,aes(x=total_reads,y=distinct_reads)) + geom_line()
image.png

思考:


(R_env) [17:34:44] kcao@localhost:~/hexinyi/kcao/try
$ preseq lc_extrap  -o yield_estimates.txt trim_25.uniq.bed 
ERROR:  too many defects in the approximation, consider running in defect mode
(R_env) [17:35:25] kcao@localhost:~/hexinyi/kcao/try
$ preseq lc_extrap  -D -o yield_estimates.txt trim_25.uniq.bed 
yield_estimates.txt.png

欢迎评论交流~😀

上一篇 下一篇

猜你喜欢

热点阅读