HyPo: Super Fast & Accurate Poli
HyPo--全称 Hybrid Polisher.使用short reads或者long reads对组装的基因组序列进行polish. 它利用基因组中唯一的kmers,通过部分排好序的read-segments比对片段来选择性地polish contigs的segments。
请注意,“short reads”不一定非得是NGS short reads; 由PacBio SequelII产生的HiFi基因组reads(例如CCS)也可以。要求:这些reads必须高度准确(>98%的准确度)。
为什么会有一个racon的比较,一个
As demonstrated on human genome assemblies, Hypo generates significantly more accurate polished assembly in about one-third time with about half the memory requirements in comparison to contemporary widely used polishers like Racon.
Hypo 的输入文件需满足如下要求:
- Short reads/HiFi reads (FASTA/FASTQ,可以是压缩文件)
- Draft contigs (in FASTA/FASTQ,可以是压缩文件)
- Alignments between short reads (or HiFi reads) and the draft (in sam/bam format; should contain CIGAR).Long (noisy) reads 也可以, 需和draft基因组文件比对生成sam/bam格式文件.
- 预期 short reads(或HiFi reads)的平均覆盖率和基因组的大致大小。
在下面的内容中,HiFi reads可以替代short reads:
- 理论上,将draft contigs(未纠错的)大致分为两类区域(片段):强区域和弱区域。
- 强区域是那些有足够证据支持,证明其正确性的区域,因此不需要polish。
- 弱区域将使用POA进行polish。每个弱区域将使用short reads或long reads进行polish;short reads优先于long reads。
- 为了确定强区域,我们使用solid kmers(预估的唯一基因组kmers)。强区域在选择read-segments来polish相邻的弱区时也起到了一定的作用。此外,我们的方法也将long reads及他们组装产生的homopolymer error考虑其中。
Installation
Hypo is only available for Unix-like platforms (Linux and MAC OS). We recommend using Option_1 for a convenient installation and in the case where target machine is different from the one on which hypo is installed. On the other hand, Option_2 is more suitable if the binary of hypo is to be run on the same machine on which it is compiled because then a machine-specific optimised (and thus slightly faster) binary can be produced using the flag -Doptimise_for_native=ON
.
Option_1: Conda Package Installation
The convenient way of installation is using the conda package as follows:
conda install -c bioconda hypo
Htslib 1.10 may cause conflicts with some of the already installed packages. In such a case, hypo can be installed in a new environment as follows:
conda create --name hypo_env
conda activate hypo_env
conda install -c bioconda hypo
Option_2: Installation from the source
CmakeLists is provided in the project root folder.
Pre-requisites
For installing from the source, the following requirements are assumed to be installed already (with path to their binaries available in $PATH).
-
Zlib
-
OpenMP
-
GCC (>=7.3)
- Following are the commands to update GCC (say to GCC 8) on an Ubuntu machine (from say GCC 5):
sudo apt-get update; sudo apt-get install build-essential software-properties-common -y; sudo add-apt-repository ppa:ubuntu-toolchain-r/test -y; sudo apt update; sudo apt install gcc-snapshot -y; sudo apt update sudo apt install gcc-8 g++-8 -y; sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 60 --slave /usr/bin/g++ g++ /usr/bin/g++-8 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 60 --slave /usr/bin/g++ g++ /usr/bin/g++-5;
-
HTSLIB (version >=1.10)
- If htslib version 1.10 or higher is not installed, we recommend using
install_deps.sh
in the project folder to install it locally.
- If htslib version 1.10 or higher is not installed, we recommend using
-
- Required for suk
- Can also be installed using conda as follows:
conda install -c bioconda kmc
Building Executable
Run the following commands to build a binary (executable) hypo
in build/bin
: If htslib version 1.10 or higher is installed:
git clone --recursive https://github.com/kensung-lab/hypo hypo
cd hypo
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -Doptimise_for_native=ON ..
make -j 8
If htslib is not installed or the version is smaller than 1.10:
git clone --recursive https://github.com/kensung-lab/hypo hypo
cd hypo
chmod +x install_deps.sh
./install_deps.sh
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -Doptimise_for_native=ON ..
make -j 8
Notes:
- If
--recursive
was omitted fromgit clone
, please rungit submodule init
andgit submodule update
before compiling. - If target machine is different from the one on which Hypo is being compiled, exclude the flag
-Doptimise_for_native=ON
.
Usage of the tool:
Usage: hypo <args>
** Mandatory args:
-r, --reads-short <str>
Input file name containing reads (in fasta/fastq format; can be compressed). A list of files containing file names in each line can be passed with @ prefix.
-d, --draft <str>
Input file name containing the draft contigs (in fasta/fastq format; can be compressed).
-b, --bam-sr <str>
Input file name containing the alignments of short reads against the draft (in bam/sam format; must have CIGAR information).
-c, --coverage-short <int>
Approximate mean coverage of the short reads.
-s, --size-ref <str>
Approximate size of the genome (a number; could be followed by units k/m/g; e.g. 10m, 2.3g).
** Optional args:
-B, --bam-lr <str>
Input file name containing the alignments of long reads against the draft (in bam/sam format; must have CIGAR information).
[Only Short reads polishing will be performed if this argument is not given]
-o, --output <str>
Output file name.
[Default] hypo_<draft_file_name>.fasta in the working directory.
-t, --threads <int>
Number of threads.
[Default] 1.
-p, --processing-size <int>
Number of contigs to be processed in one batch. Lower value means less memory usage but slower speed.
[Default] All the contigs in the draft.
-k, --kind-sr <str>
Kind of the short reads.
[Valid Values]
sr (Corresponding to NGS reads like Illumina reads)
ccs (Corresponding to HiFi reads like PacBio CCS reads)
[Default] sr.
-m, --match-sr <int>
Score for matching bases for short reads.
[Default] 5.
-x, --mismatch-sr <int>
Score for mismatching bases for short reads.
[Default] -4.
-g, --gap-sr <int>
Gap penalty for short reads (must be negative).
[Default] -8.
-M, --match-lr <int>
Score for matching bases for long reads.
[Default] 3.
-X, --mismatch-lr <int>
Score for mismatching bases for long reads.
[Default] -5.
-G, --gap-lr <int>
Gap penalty for long reads (must be negative).
[Default] -4.
-n, --ned-th <int>
Threshold for Normalised Edit Distance of long arms allowed in a window (in %). Higher number means more arms allowed which may slow down the execution.
[Default] 20.
-q, --qual-map-th <int>
Threshold for mapping quality of reads. The reads with mapping quality below this threshold will not be taken into consideration.
[Default] 2.
-i, --intermed
Store or use (if already exist) the intermediate files.
[Currently, only Solid kmers are stored as an intermediate file.].
-h, --help
Print the usage.
输出文件
如果没有提供参数--output
(-o
) ,则默认在和draft contig相同的目录下生成 hypo_X.fasta
。
中间文件
如果使用参数 --intermed
(-i
) , hypo将在第一次运行时将 solid kmers对应的中间文件存储在名为“aux”的文件夹中。另一个指示运行进度的文件也将在该文件夹中创建。在运行下一轮时(使用“-i”),将使用这些中间文件,而不是运行“suk”模块。建议在使用Hypo进行多轮polish时使用此标志,以便使后续轮更快
资源使用
可选参数 --processing-size
(-p
) 控制每一batch中处理的contigs的数目。默认情况下,所有的 contigs 将会被 single batch处理. 在一个batch中处理的contigs数目越多,内存消耗就越高。另一方面,较低的contigs数目可能无法有效地利用线程数量。作为参考,对于整个人类基因组,我们在48核机器上使用 -p
96,使用了大约380G内存,3小时内完成了运行(仅Illumina polishing)。建议使用 -p
是 -t
的倍数。如果基因组大小对于可用的RAM来说不是太大,建议在一个批次中处理所有的contigs(即避免指定 -p
)。
Normalised Edit Distance
可选参数 --ned-th
(or -n
)仅和long reads的polish相关。它设置了标准化编辑距离(Normalised Edit Distance)的阈值,该距离测量long read与其对齐的draft segment之间的相似性。标准化编辑距离=(对齐的read_segment 和draft /aligned_length of draft之间的1bp编辑距离)x100。比阈值越高,更多的read-segments 将会被考虑。我们建议对整个人类基因组polishing使用默认值。
例1 (using Illumina paired-end reads)
Mapping the short reads to contigs:
$R1
, $R2
:short reads (paired-end) ;
$Draft
:draft contigs;
$NUMTH
:线程数;
minimap2 --secondary=no --MD -ax sr -t $NUMTH $DRAFT $R1 $R2 | samtools view -Sb - > mapped-sr.bam
samtools sort -@$NUMTH -o mapped-sr.sorted.bam mapped-sr.bam
samtools index mapped-sr.sorted.bam
rm mapped-sr.bam
Mapping the long reads to contigs:
$LONGR
:long reads (PacBio or ONT) ;
$Draft
:draft contigs;
$NUMTH
:线程数;
$RTYPE
:map-pb
for PacBio; map-ont
for ONT reads.
minimap2 --secondary=no --MD -ax $RTYPE -t $NUMTH $DRAFT $LONGR | samtools view -Sb - > mapped-lg.bam
samtools sort -@$NUMTH -o mapped-lg.sorted.bam mapped-lg.bam
samtools index mapped-lg.sorted.bam
rm mapped-lg.bam
Running Hypo:
首先,创建一个含有short reads的文本文件。
echo -e "$R1\n$R2" > il_names.txt
基因组大约3Gbp(-s
),Illumina reads的覆盖度大约55x(-c
). 使用48线程(-t
),每个batch运行96个contigs(-p
).
仅仅使用short reads 进行polish:
./bin/hypo -d $DRAFT -r @il_names.txt -s 3g -c 55 -b mapped-sr.sorted.bam -p 96 -t 48 -o whole_genome.h.fa
使用short reads 和long reads进行polish:
./bin/hypo -d $DRAFT -r @il_names.txt -s 3g -c 55 -b mapped-sr.sorted.bam -B mapped-lg.sorted.bam -p 96 -t 48 -o whole_genome.h2.fa
例 2 (using CCS reads)
Mapping the CCS reads to contigs:
$READS
:CCS reads;
$Draft
:draft contigs;
$NUMTH
:线程数;
minimap2 --secondary=no --MD -ax asm20 -t $NUMTH $DRAFT $READS | samtools view -Sb - > mapped-ccs.bam
samtools sort -@$NUMTH -o mapped-ccs.sorted.bam mapped-ccs.bam
samtools index mapped-ccs.sorted.bam
rm mapped-ccs.bam
Running Hypo:
基因组大约3Gbp(-s
),CCS reads的覆盖度大约33x(-c
). 使用48线程(-t
),一个batch运行全部的contigs.
仅仅 CCS polish:
./bin/hypo -d $DRAFT -r $READS -s 3g -c 30 -b mapped-ccs.sorted.bam -t 48 -o whole_genome.h.fa
Method and Results
对于整个人类基因组(HG002)的组装,仅使用Illumina reads,Hypo花了大约3个小时和大约380G RAM来polish(在48核48线程的机器上)。对于使用Illumina和PacBio reads进行抛光,使用约4小时15分钟和大约410G RAM。方法和部分结果可在BioRxiv找到.
HyPo: Super Fast & Accurate Polisher for Long Read Genome Assemblies
Ritu Kundu, Joshua Casey, Wing-Kin Sung
bioRxiv 2019.12.19.882506; doi: https://doi.org/10.1101/2019.12.19.882506
Limitations
- 待polish的contig/scaffold中的N将和碱基一样对待,(即,N将被对应POA图中与之align的最频繁的碱基所取代)。
- contig/scaffold的最大长度被限制在4,294,967,295(32位无符号整数的最大值)