利用purge_haplotigs去杂合
2019-11-07 本文已影响0人
多啦A梦的时光机_648d
当基因组某些区域可能有着比较高的杂合度,这会导致基因组该区域的两个单倍型被分别组装成primary contig, 而不是一个为primary contig, 另一个是associated haplotig. 如果下游分析主要关注于单倍型,这就会导致一些问题。purge_haplogs则是最近开发,用于三代组装的基因组。它根据minimap2的比对结果,通过分析比对read的覆盖度决定谁去谁留。该工具适用于单倍型组装软件,例如 Canu, FALCON或 FALCON-Unzip primary contigs。

一:软件安装
依赖软件比较多,直接用conda装就好了。
conda install -y purge_haplotigs
要是爱折腾也可以自己装(不需要编译直接安上可以用)。记得加环境。
# clone the git
$ git clone [https://bitbucket.org/mroachawri/purge_haplotigs.git](https://bitbucket.org/mroachawri/purge_haplotigs.git)
# create a softlink to ~/bin
$ ln -s ~/purge_haplotigs/bin/purge_haplotigs ~/bin/purge_haplotigs
# test Purge Haplotigs
$ purge_haplotigs
USAGE:
purge_haplotigs <command> [options]
COMMANDS:
-- Purge Haplotigs pipeline:
readhist First step, generate a read-depth histogram for the genome
contigcov Second step, get contig coverage stats and flag 'suspect' contigs
purge Third step, identify and reassign haplotigs
-- Other scripts
ncbiplace Generate a placement file for submission to NCBI
test Test everything!
# test the pipeline
$ purge_haplotigs test
# <lots of jargon>
ALL TESTS PASSED
依赖软件:
- bedtools
$ wget [https://github.com/arq5x/bedtools2/releases/download/v2.28.0/bedtools-2.28.0.tar.gz](https://github.com/arq5x/bedtools2/releases/download/v2.28.0/bedtools-2.28.0.tar.gz)
$ tar -zxvf bedtools-2.28.0.tar.gz
$ cd bedtools2
$ make
- samtools
https://github.com/samtools/samtools
./configure # Needed for choosing optional functionality
make
make install
- Rscript
$ sudo apt install r-base r-base-dev
# on a new install we wont have the required R library 'ggplot2' installed
$ sudo su - -c "R -e \"install.packages('ggplot2', repos='http://cran.rstudio.com/')\""
$ wget https://github.com/lh3/minimap2/releases/download/v2.13/minimap2-2.13_x64-linux.tar.bz2
$ tar xf minimap2-2.13_x64-linux.tar.bz2
cd && make
- MUMmer
$ wget https://github.com/mummer4/mummer/releases/download/v4.0.0beta2/mummer-4.0.0beta2.tar.gz
$ tar xf mummer-4.0.0beta2.tar.gz
# compile
$ cd mummer-4.0.0beta2
$ ./configure
$ make
$ cd ../
# install (just softlink to the home bin directory ~/bin)
$ ln -s ~/mummer-4.0.0beta2/delta-filter ~/bin/delta-filter
$ ln -s ~/mummer-4.0.0beta2/nucmer ~/bin/nucmer
$ ln -s ~/mummer-4.0.0beta2/show-coords ~/bin/show-coords
二:数据准备
- 组装得到的基因组文件(genome.fasta)
- 用于组装的三代测序文件(subreads.fasta.gz)或者是二代Ilumilina数据也可以
- 用上面两个文件比对的到的aligned.bam文件
minimap2 -ax map-pb genome.fa subreads.fasta.gz \
| samtools view -hF 256 - \
| samtools sort -@ 8 -m 1G -o aligned.bam -T tmp.ali
-h:默认输出sam文件不带表头,该参数设定后输出带表头信息
-F:获得比对上(mapped)的过滤设置[默认0]
-m:每个线程运行内存大小(可使用K M G表示)
-T:PREFIX临时文件前缀
三:运行程序
- step1
通过运行readhist脚本生成覆盖率直方图。该脚本将生成一个直方图png图像文件供您查看,并生成一个BEDTools 'genomecov'输出文件供您在步骤2中使用。
#!text
# USAGE:
% purge_haplotigs readhist aligned.bam
REQUIRED:
aligned.bam BAM file of aligned and sorted reads/subreads to the reference
eg.
$purge_haplotigs readhist -b aligned.bam -g genome.fasta -t 20
# -t 线程数, 不宜过高,过高反而没有效果。
你应该会得到一个双峰直方图——一个单倍体水平的峰值,一个二倍体水平的峰值。 如果你使用的是 phased assembly,二倍体峰值可能会非常小。
需要选择的三个点数值(cutoffs for low coverage, low point between the two peaks, and high coverage)
PacBio subreads on Diploid-phased assembly (Primary + Haplotigs)

Illumina PE reads on Haploid assembly (Primary contigs)

- step2
运行contigcov脚本,使用上一步的 cutoffs值(就是那三个点的数值)来分析重叠的覆盖。 该脚本生成一个contig覆盖统计的csv文件,并标记可疑的contigs,以便进一步分析或删除。
#!text
# USAGE:
% purge_haplotigs contigcov -i aligned.bam.genecov -l 30 -m 80 -h 145 [-o coverage_stats.csv -j 80 -s 80 ]
REQUIRED:
-i / -in The bedtools genomecov output that was produced from 'purge_haplotigs readhist'
-l / -low The read depth low cutoff (use the histogram to eyeball these cutoffs)
-h / -high The read depth high cutoff
-m / -mid The low point between the haploid and diploid peaks
OPTIONAL:
-o / -out Choose an output file name (CSV format, DEFAULT = coverage_stats.csv)
-j / -junk Auto-assign contig as "j" (junk) if this percentage or greater of the contig is
low/high coverage (default = 80, > 100 = don't junk anything)
-s / -suspect Auto-assign contig as "s" (suspected haplotig) if this percentage or less of the
contig is diploid level of coverage (default = 80)
eg.
$purge_haplotigs contigcov -i aligned.bam.genecov -o coverage_stats.csv -l $low_cutoff -m $mid_cutoff -h $high_cutoff
结果生成一个 coverage_stats.csv文件。
- step3
运行purge脚本, 这个脚本将自动运行一个BEDTools窗口覆盖分析,一个blast搜索,并执行lastz比对等,以评估哪些contigs重新分配及哪些contigs保留。
#!text
# USAGE:
% purge_haplotigs purge -g genome.fasta -c coverage_stats.csv -b aligned.bam
REQUIRED:
-g / -genome Genome assembly in fasta format. Needs to be indexed with samtools faidx.
-c / -coverage Contig by contig coverage stats csv file from the previous step.
-b / -bam Samtools-indexed bam file of aligned and sorted reads/subreads to the reference
(same one used for generating the read-depth histogram).
OPTIONAL:
-t / -threads Number of worker threads to use. DEFAULT = 4.
-o / -outprefix Prefix for the curated assembly. DEFAULT = "curated".
-a / -align_cov Percent cutoff for identifying a contig as a haplotig. DEFAULT = 70.
-m / -max_match Percent cutoff for identifying repetitive contigs. DEFAULT = 250.
-wind_len Length of genome window (in bp) for the coverage track in the dotplots. DEFAULT = 9000.
-wind_step Step distance for genome windows for coverage track. DEFAULT = 3000.
eg.
$purge_haplotigs purge -g genome.fasta -c coverage_stats.csv -b aligned.bam -t 4 -a 60
All done!
会生成5个文件:
.artefacts.fasta:无用的contig,也就是没有足够覆盖度的contig.
.fasta:新的单倍型组装
.haplotigs.fasta:从原本组装分出来的haplotigs
.reassignments.tsv: 单倍型的分配信息
.contig_associations.log: 运行日志。
下面是其中一个记录,表示000000F_003和000000F_009的HAPLOTIG, 而000000F_009和000000F_005和000000F_010又是000000F的HAPLOTIG。

参考:
https://github.com/skingan/purge_haplotigs_multiBAM
https://www.jianshu.com/p/8ed5b494b131