利用purge_haplotigs去杂合

2019-11-07 本文已影响0人多啦A梦的时光机_648d

当基因组某些区域可能有着比较高的杂合度，这会导致基因组该区域的两个单倍型被分别组装成primary contig，而不是一个为primary contig，另一个是associated haplotig. 如果下游分析主要关注于单倍型，这就会导致一些问题。purge_haplogs则是最近开发，用于三代组装的基因组。它根据minimap2的比对结果，通过分析比对read的覆盖度决定谁去谁留。该工具适用于单倍型组装软件，例如 Canu, FALCON或 FALCON-Unzip primary contigs。

流程图

一：软件安装

依赖软件比较多，直接用conda装就好了。

conda install -y purge_haplotigs

要是爱折腾也可以自己装（不需要编译直接安上可以用）。记得加环境。

 # clone the git

$ git clone [https://bitbucket.org/mroachawri/purge_haplotigs.git](https://bitbucket.org/mroachawri/purge_haplotigs.git)

# create a softlink to ~/bin

$ ln -s ~/purge_haplotigs/bin/purge_haplotigs ~/bin/purge_haplotigs

# test Purge Haplotigs

$ purge_haplotigs

USAGE:

purge_haplotigs <command> [options]

COMMANDS:

-- Purge Haplotigs pipeline:

readhist First step, generate a read-depth histogram for the genome

contigcov Second step, get contig coverage stats and flag 'suspect' contigs

purge Third step, identify and reassign haplotigs

-- Other scripts

ncbiplace Generate a placement file for submission to NCBI

test Test everything!

# test the pipeline

$ purge_haplotigs test

# <lots of jargon>

ALL TESTS PASSED

依赖软件：

bedtools

 $ wget [https://github.com/arq5x/bedtools2/releases/download/v2.28.0/bedtools-2.28.0.tar.gz](https://github.com/arq5x/bedtools2/releases/download/v2.28.0/bedtools-2.28.0.tar.gz)

$ tar -zxvf bedtools-2.28.0.tar.gz

$ cd bedtools2

$ make

samtools

https://github.com/samtools/samtools

./configure           # Needed for choosing optional functionality
make
make install

Rscript

$ sudo apt install r-base r-base-dev

# on a new install we wont have the required R library 'ggplot2' installed
$ sudo su - -c "R -e \"install.packages('ggplot2', repos='http://cran.rstudio.com/')\""

Minimap2

$ wget https://github.com/lh3/minimap2/releases/download/v2.13/minimap2-2.13_x64-linux.tar.bz2
$ tar xf minimap2-2.13_x64-linux.tar.bz2
cd && make

MUMmer

$ wget https://github.com/mummer4/mummer/releases/download/v4.0.0beta2/mummer-4.0.0beta2.tar.gz
$ tar xf mummer-4.0.0beta2.tar.gz
# compile
$ cd mummer-4.0.0beta2
$ ./configure
$ make
$ cd ../
# install (just softlink to the home bin directory ~/bin)
$ ln -s ~/mummer-4.0.0beta2/delta-filter ~/bin/delta-filter
$ ln -s ~/mummer-4.0.0beta2/nucmer ~/bin/nucmer
$ ln -s ~/mummer-4.0.0beta2/show-coords ~/bin/show-coords

二：数据准备

组装得到的基因组文件（genome.fasta）
用于组装的三代测序文件（subreads.fasta.gz）或者是二代Ilumilina数据也可以
用上面两个文件比对的到的aligned.bam文件

minimap2 -ax map-pb genome.fa subreads.fasta.gz \
    | samtools view -hF 256 - \
    | samtools sort -@ 8 -m 1G -o aligned.bam -T tmp.ali

-h:默认输出sam文件不带表头，该参数设定后输出带表头信息
-F：获得比对上（mapped）的过滤设置[默认0]
-m：每个线程运行内存大小（可使用K M G表示）
-T：PREFIX临时文件前缀

三：运行程序

step1
通过运行readhist脚本生成覆盖率直方图。该脚本将生成一个直方图png图像文件供您查看，并生成一个BEDTools 'genomecov'输出文件供您在步骤2中使用。

#!text
# USAGE:
% purge_haplotigs  readhist  aligned.bam


REQUIRED:
aligned.bam   BAM file of aligned and sorted reads/subreads to the reference

eg.
$purge_haplotigs  readhist  -b aligned.bam  -g genome.fasta  -t 20
# -t 线程数, 不宜过高，过高反而没有效果。

你应该会得到一个双峰直方图——一个单倍体水平的峰值，一个二倍体水平的峰值。如果你使用的是 phased assembly，二倍体峰值可能会非常小。
需要选择的三个点数值(cutoffs for low coverage, low point between the two peaks, and high coverage)
PacBio subreads on Diploid-phased assembly (Primary + Haplotigs)

用的pacbio数据
Illumina PE reads on Haploid assembly (Primary contigs)

用的Ilumilina数据

step2
运行contigcov脚本，使用上一步的 cutoffs值(就是那三个点的数值)来分析重叠的覆盖。该脚本生成一个contig覆盖统计的csv文件，并标记可疑的contigs，以便进一步分析或删除。

#!text
# USAGE:
% purge_haplotigs  contigcov  -i aligned.bam.genecov  -l 30  -m 80  -h 145  [-o coverage_stats.csv -j 80  -s 80 ]


REQUIRED:
-i / -in        The bedtools genomecov output that was produced from 'purge_haplotigs readhist'
-l / -low       The read depth low cutoff (use the histogram to eyeball these cutoffs)
-h / -high      The read depth high cutoff
-m / -mid       The low point between the haploid and diploid peaks


OPTIONAL:
-o / -out       Choose an output file name (CSV format, DEFAULT = coverage_stats.csv)
-j / -junk      Auto-assign contig as "j" (junk) if this percentage or greater of the contig is
                low/high coverage (default = 80, > 100 = don't junk anything)
-s / -suspect   Auto-assign contig as "s" (suspected haplotig) if this percentage or less of the
                contig is diploid level of coverage (default = 80)

eg.
$purge_haplotigs  contigcov  -i aligned.bam.genecov  -o coverage_stats.csv  -l $low_cutoff  -m $mid_cutoff  -h $high_cutoff

结果生成一个 coverage_stats.csv文件。

step3
运行purge脚本，这个脚本将自动运行一个BEDTools窗口覆盖分析，一个blast搜索，并执行lastz比对等，以评估哪些contigs重新分配及哪些contigs保留。

#!text
# USAGE:
% purge_haplotigs  purge  -g genome.fasta  -c coverage_stats.csv  -b aligned.bam


REQUIRED:
-g / -genome        Genome assembly in fasta format. Needs to be indexed with samtools faidx.
-c / -coverage      Contig by contig coverage stats csv file from the previous step.
-b / -bam           Samtools-indexed bam file of aligned and sorted reads/subreads to the reference
                    (same one used for generating the read-depth histogram).


OPTIONAL:
-t / -threads       Number of worker threads to use. DEFAULT = 4.
-o / -outprefix     Prefix for the curated assembly. DEFAULT = "curated".
-a / -align_cov     Percent cutoff for identifying a contig as a haplotig. DEFAULT = 70.
-m / -max_match     Percent cutoff for identifying repetitive contigs. DEFAULT = 250.
-wind_len           Length of genome window (in bp) for the coverage track in the dotplots. DEFAULT = 9000.
-wind_step          Step distance for genome windows for coverage track. DEFAULT = 3000.

eg.
$purge_haplotigs purge  -g genome.fasta  -c coverage_stats.csv  -b aligned.bam  -t 4  -a 60

All done!
会生成5个文件：

.artefacts.fasta：无用的contig，也就是没有足够覆盖度的contig.
.fasta：新的单倍型组装
.haplotigs.fasta：从原本组装分出来的haplotigs
.reassignments.tsv: 单倍型的分配信息
.contig_associations.log: 运行日志。
下面是其中一个记录，表示000000F_003和000000F_009的HAPLOTIG, 而000000F_009和000000F_005和000000F_010又是000000F的HAPLOTIG。

日志

参考：

https://github.com/skingan/purge_haplotigs_multiBAM
https://www.jianshu.com/p/8ed5b494b131

利用purge_haplotigs去杂合

一：软件安装

二：数据准备

三：运行程序

猜你喜欢

热点阅读