基因组生物信息软件

利用purge_haplotigs去杂合

2019-11-07  本文已影响0人  多啦A梦的时光机_648d

当基因组某些区域可能有着比较高的杂合度,这会导致基因组该区域的两个单倍型被分别组装成primary contig, 而不是一个为primary contig, 另一个是associated haplotig. 如果下游分析主要关注于单倍型,这就会导致一些问题。purge_haplogs则是最近开发,用于三代组装的基因组。它根据minimap2的比对结果,通过分析比对read的覆盖度决定谁去谁留。该工具适用于单倍型组装软件,例如 Canu, FALCON或 FALCON-Unzip primary contigs。


流程图

一:软件安装

依赖软件比较多,直接用conda装就好了。

conda install -y purge_haplotigs

要是爱折腾也可以自己装(不需要编译直接安上可以用)。记得加环境。

 # clone the git

$ git clone [https://bitbucket.org/mroachawri/purge_haplotigs.git](https://bitbucket.org/mroachawri/purge_haplotigs.git)

# create a softlink to ~/bin

$ ln -s ~/purge_haplotigs/bin/purge_haplotigs ~/bin/purge_haplotigs

# test Purge Haplotigs

$ purge_haplotigs

USAGE:

purge_haplotigs <command> [options]

COMMANDS:

-- Purge Haplotigs pipeline:

readhist First step, generate a read-depth histogram for the genome

contigcov Second step, get contig coverage stats and flag 'suspect' contigs

purge Third step, identify and reassign haplotigs

-- Other scripts

ncbiplace Generate a placement file for submission to NCBI

test Test everything!

# test the pipeline

$ purge_haplotigs test

# <lots of jargon>

ALL TESTS PASSED

依赖软件:

 $ wget [https://github.com/arq5x/bedtools2/releases/download/v2.28.0/bedtools-2.28.0.tar.gz](https://github.com/arq5x/bedtools2/releases/download/v2.28.0/bedtools-2.28.0.tar.gz)

$ tar -zxvf bedtools-2.28.0.tar.gz

$ cd bedtools2

$ make

https://github.com/samtools/samtools

./configure           # Needed for choosing optional functionality
make
make install
$ sudo apt install r-base r-base-dev

# on a new install we wont have the required R library 'ggplot2' installed
$ sudo su - -c "R -e \"install.packages('ggplot2', repos='http://cran.rstudio.com/')\""
$ wget https://github.com/lh3/minimap2/releases/download/v2.13/minimap2-2.13_x64-linux.tar.bz2
$ tar xf minimap2-2.13_x64-linux.tar.bz2
cd && make
$ wget https://github.com/mummer4/mummer/releases/download/v4.0.0beta2/mummer-4.0.0beta2.tar.gz
$ tar xf mummer-4.0.0beta2.tar.gz
# compile
$ cd mummer-4.0.0beta2
$ ./configure
$ make
$ cd ../
# install (just softlink to the home bin directory ~/bin)
$ ln -s ~/mummer-4.0.0beta2/delta-filter ~/bin/delta-filter
$ ln -s ~/mummer-4.0.0beta2/nucmer ~/bin/nucmer
$ ln -s ~/mummer-4.0.0beta2/show-coords ~/bin/show-coords

二:数据准备

  1. 组装得到的基因组文件(genome.fasta)
  2. 用于组装的三代测序文件(subreads.fasta.gz)或者是二代Ilumilina数据也可以
  3. 用上面两个文件比对的到的aligned.bam文件
minimap2 -ax map-pb genome.fa subreads.fasta.gz \
    | samtools view -hF 256 - \
    | samtools sort -@ 8 -m 1G -o aligned.bam -T tmp.ali

-h:默认输出sam文件不带表头,该参数设定后输出带表头信息
-F:获得比对上(mapped)的过滤设置[默认0]
-m:每个线程运行内存大小(可使用K M G表示)
-T:PREFIX临时文件前缀

三:运行程序

#!text
# USAGE:
% purge_haplotigs  readhist  aligned.bam


REQUIRED:
aligned.bam   BAM file of aligned and sorted reads/subreads to the reference

eg.
$purge_haplotigs  readhist  -b aligned.bam  -g genome.fasta  -t 20
# -t 线程数, 不宜过高,过高反而没有效果。

你应该会得到一个双峰直方图——一个单倍体水平的峰值,一个二倍体水平的峰值。 如果你使用的是 phased assembly,二倍体峰值可能会非常小。
需要选择的三个点数值(cutoffs for low coverage, low point between the two peaks, and high coverage)
PacBio subreads on Diploid-phased assembly (Primary + Haplotigs)

用的pacbio数据
Illumina PE reads on Haploid assembly (Primary contigs)
用的Ilumilina数据
#!text
# USAGE:
% purge_haplotigs  contigcov  -i aligned.bam.genecov  -l 30  -m 80  -h 145  [-o coverage_stats.csv -j 80  -s 80 ]


REQUIRED:
-i / -in        The bedtools genomecov output that was produced from 'purge_haplotigs readhist'
-l / -low       The read depth low cutoff (use the histogram to eyeball these cutoffs)
-h / -high      The read depth high cutoff
-m / -mid       The low point between the haploid and diploid peaks


OPTIONAL:
-o / -out       Choose an output file name (CSV format, DEFAULT = coverage_stats.csv)
-j / -junk      Auto-assign contig as "j" (junk) if this percentage or greater of the contig is
                low/high coverage (default = 80, > 100 = don't junk anything)
-s / -suspect   Auto-assign contig as "s" (suspected haplotig) if this percentage or less of the
                contig is diploid level of coverage (default = 80)

eg.
$purge_haplotigs  contigcov  -i aligned.bam.genecov  -o coverage_stats.csv  -l $low_cutoff  -m $mid_cutoff  -h $high_cutoff

结果生成一个 coverage_stats.csv文件。

#!text
# USAGE:
% purge_haplotigs  purge  -g genome.fasta  -c coverage_stats.csv  -b aligned.bam


REQUIRED:
-g / -genome        Genome assembly in fasta format. Needs to be indexed with samtools faidx.
-c / -coverage      Contig by contig coverage stats csv file from the previous step.
-b / -bam           Samtools-indexed bam file of aligned and sorted reads/subreads to the reference
                    (same one used for generating the read-depth histogram).


OPTIONAL:
-t / -threads       Number of worker threads to use. DEFAULT = 4.
-o / -outprefix     Prefix for the curated assembly. DEFAULT = "curated".
-a / -align_cov     Percent cutoff for identifying a contig as a haplotig. DEFAULT = 70.
-m / -max_match     Percent cutoff for identifying repetitive contigs. DEFAULT = 250.
-wind_len           Length of genome window (in bp) for the coverage track in the dotplots. DEFAULT = 9000.
-wind_step          Step distance for genome windows for coverage track. DEFAULT = 3000.

eg.
$purge_haplotigs purge  -g genome.fasta  -c coverage_stats.csv  -b aligned.bam  -t 4  -a 60

All done!
会生成5个文件:

.artefacts.fasta:无用的contig,也就是没有足够覆盖度的contig.
.fasta:新的单倍型组装
.haplotigs.fasta:从原本组装分出来的haplotigs
.reassignments.tsv: 单倍型的分配信息
.contig_associations.log: 运行日志。
下面是其中一个记录,表示000000F_003和000000F_009的HAPLOTIG, 而000000F_009和000000F_005和000000F_010又是000000F的HAPLOTIG。
日志

参考:

https://github.com/skingan/purge_haplotigs_multiBAM
https://www.jianshu.com/p/8ed5b494b131
上一篇 下一篇

猜你喜欢

热点阅读