HyPo: Super Fast & Accurate Poli

2020-07-23 本文已影响0人生信师姐

HyPo--全称 Hybrid Polisher．使用short reads或者long reads对组装的基因组序列进行polish. 它利用基因组中唯一的kmers，通过部分排好序的read-segments比对片段来选择性地polish contigs的segments。

请注意，“short reads”不一定非得是NGS short reads; 由PacBio SequelII产生的HiFi基因组reads(例如CCS)也可以。要求：这些reads必须高度准确(>98%的准确度)。

为什么会有一个racon的比较，一个
As demonstrated on human genome assemblies, Hypo generates significantly more accurate polished assembly in about one-third time with about half the memory requirements in comparison to contemporary widely used polishers like Racon.

Hypo 的输入文件需满足如下要求:

Short reads/HiFi reads (FASTA/FASTQ，可以是压缩文件)
Draft contigs (in FASTA/FASTQ，可以是压缩文件)
Alignments between short reads (or HiFi reads) and the draft (in sam/bam format; should contain CIGAR)．Long (noisy) reads 也可以, 需和draft基因组文件比对生成sam/bam格式文件．
预期 short reads（或HiFi reads）的平均覆盖率和基因组的大致大小。

在下面的内容中，HiFi reads可以替代short reads：

理论上，将draft contigs（未纠错的）大致分为两类区域（片段）：强区域和弱区域。
强区域是那些有足够证据支持，证明其正确性的区域，因此不需要polish。
弱区域将使用POA进行polish。每个弱区域将使用short reads或long reads进行polish；short reads优先于long reads。
为了确定强区域，我们使用solid kmers（预估的唯一基因组kmers）。强区域在选择read-segments来polish相邻的弱区时也起到了一定的作用。此外，我们的方法也将long reads及他们组装产生的homopolymer error考虑其中。

Installation

Hypo is only available for Unix-like platforms (Linux and MAC OS). We recommend using Option_1 for a convenient installation and in the case where target machine is different from the one on which hypo is installed. On the other hand, Option_2 is more suitable if the binary of hypo is to be run on the same machine on which it is compiled because then a machine-specific optimised (and thus slightly faster) binary can be produced using the flag -Doptimise_for_native=ON.

Option_1: Conda Package Installation

The convenient way of installation is using the conda package as follows:

conda install -c bioconda hypo

Htslib 1.10 may cause conflicts with some of the already installed packages. In such a case, hypo can be installed in a new environment as follows:

conda create --name hypo_env
conda activate hypo_env  
conda install -c bioconda hypo

Option_2: Installation from the source

CmakeLists is provided in the project root folder.

Pre-requisites

For installing from the source, the following requirements are assumed to be installed already (with path to their binaries available in $PATH).

Zlib
OpenMP

GCC (>=7.3)

Following are the commands to update GCC (say to GCC 8) on an Ubuntu machine (from say GCC 5):

  sudo apt-get update; sudo apt-get install build-essential software-properties-common -y;
  sudo add-apt-repository ppa:ubuntu-toolchain-r/test -y; sudo apt update; 
  sudo apt install gcc-snapshot -y; sudo apt update
  sudo apt install gcc-8 g++-8 -y; 
  sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 60 --slave /usr/bin/g++ g++ /usr/bin/g++-8
  sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 60 --slave /usr/bin/g++ g++ /usr/bin/g++-5;

HTSLIB (version >=1.10)
- If htslib version 1.10 or higher is not installed, we recommend using install_deps.sh in the project folder to install it locally.
KMC3
- Required for suk
- Can also be installed using conda as follows:
```
conda install -c bioconda kmc
```

Building Executable

Run the following commands to build a binary (executable) hypo in build/bin : If htslib version 1.10 or higher is installed:

  git clone --recursive https://github.com/kensung-lab/hypo hypo
  cd hypo
  mkdir build
  cd build
  cmake -DCMAKE_BUILD_TYPE=Release -Doptimise_for_native=ON ..
  make -j 8

If htslib is not installed or the version is smaller than 1.10:

  git clone --recursive https://github.com/kensung-lab/hypo hypo
  cd hypo
  chmod +x install_deps.sh
  ./install_deps.sh
  mkdir build
  cd build
  cmake -DCMAKE_BUILD_TYPE=Release -Doptimise_for_native=ON ..
  make -j 8

Notes:

If --recursive was omitted from git clone, please run git submodule init and git submodule update before compiling.
If target machine is different from the one on which Hypo is being compiled, exclude the flag -Doptimise_for_native=ON.

Usage of the tool:

 Usage: hypo <args>

 ** Mandatory args:
    -r, --reads-short <str>
    Input file name containing reads (in fasta/fastq format; can be compressed). A list of files containing file names in each line can be passed with @ prefix.

    -d, --draft <str>
    Input file name containing the draft contigs (in fasta/fastq format; can be compressed). 

    -b, --bam-sr <str>
    Input file name containing the alignments of short reads against the draft (in bam/sam format; must have CIGAR information). 

    -c, --coverage-short <int>
    Approximate mean coverage of the short reads. 

    -s, --size-ref <str>
    Approximate size of the genome (a number; could be followed by units k/m/g; e.g. 10m, 2.3g). 

** Optional args:
    -B, --bam-lr <str>
    Input file name containing the alignments of long reads against the draft (in bam/sam format; must have CIGAR information). 
    [Only Short reads polishing will be performed if this argument is not given]

    -o, --output <str>
    Output file name. 
    [Default] hypo_<draft_file_name>.fasta in the working directory.

    -t, --threads <int>
    Number of threads. 
    [Default] 1.

    -p, --processing-size <int>
    Number of contigs to be processed in one batch. Lower value means less memory usage but slower speed. 
    [Default] All the contigs in the draft.

    -k, --kind-sr <str>
    Kind of the short reads. 
    [Valid Values] 
        sr  (Corresponding to NGS reads like Illumina reads) 
        ccs (Corresponding to HiFi reads like PacBio CCS reads) 
    [Default] sr.

    -m, --match-sr <int>
    Score for matching bases for short reads. 
    [Default] 5.

    -x, --mismatch-sr <int>
    Score for mismatching bases for short reads. 
    [Default] -4.

    -g, --gap-sr <int>
    Gap penalty for short reads (must be negative). 
    [Default] -8.

    -M, --match-lr <int>
    Score for matching bases for long reads. 
    [Default] 3.

    -X, --mismatch-lr <int>
    Score for mismatching bases for long reads. 
    [Default] -5.

    -G, --gap-lr <int>
    Gap penalty for long reads (must be negative). 
    [Default] -4.

    -n, --ned-th <int>
    Threshold for Normalised Edit Distance of long arms allowed in a window (in %). Higher number means more arms allowed which may slow down the execution.
    [Default] 20.

    -q, --qual-map-th <int>
    Threshold for mapping quality of reads. The reads with mapping quality below this threshold will not be taken into consideration. 
    [Default] 2.

    -i, --intermed
    Store or use (if already exist) the intermediate files. 
    [Currently, only Solid kmers are stored as an intermediate file.].

    -h, --help
    Print the usage.

输出文件

如果没有提供参数--output (-o) ，则默认在和draft contig相同的目录下生成 hypo_X.fasta。

中间文件

如果使用参数 --intermed (-i) , hypo将在第一次运行时将 solid kmers对应的中间文件存储在名为“aux”的文件夹中。另一个指示运行进度的文件也将在该文件夹中创建。在运行下一轮时（使用“-i”），将使用这些中间文件，而不是运行“suk”模块。建议在使用Hypo进行多轮polish时使用此标志，以便使后续轮更快

资源使用

可选参数 --processing-size (-p) 控制每一batch中处理的contigs的数目。默认情况下，所有的 contigs 将会被 single batch处理. 在一个batch中处理的contigs数目越多，内存消耗就越高。另一方面，较低的contigs数目可能无法有效地利用线程数量。作为参考，对于整个人类基因组，我们在48核机器上使用 -p 96，使用了大约380G内存，3小时内完成了运行（仅Illumina polishing）。建议使用 -p是 -t的倍数。如果基因组大小对于可用的RAM来说不是太大，建议在一个批次中处理所有的contigs（即避免指定 -p）。

Normalised Edit Distance

可选参数 --ned-th (or -n)仅和long reads的polish相关。它设置了标准化编辑距离(Normalised Edit Distance)的阈值，该距离测量long read与其对齐的draft segment之间的相似性。标准化编辑距离=（对齐的read_segment 和draft /aligned_length of draft之间的1bp编辑距离）x100。比阈值越高，更多的read-segments 将会被考虑。我们建议对整个人类基因组polishing使用默认值。

例1 (using Illumina paired-end reads)

Mapping the short reads to contigs:

$R1, $R2：short reads (paired-end) ；
$Draft：draft contigs；
$NUMTH ：线程数；

minimap2 --secondary=no --MD -ax sr -t $NUMTH $DRAFT $R1 $R2 | samtools view -Sb - > mapped-sr.bam
samtools sort -@$NUMTH -o mapped-sr.sorted.bam mapped-sr.bam
samtools index mapped-sr.sorted.bam
rm mapped-sr.bam

Mapping the long reads to contigs:

$LONGR：long reads (PacBio or ONT) ;
$Draft：draft contigs；
$NUMTH：线程数；
$RTYPE：map-pb for PacBio； map-ont for ONT reads.

minimap2 --secondary=no --MD -ax $RTYPE -t $NUMTH $DRAFT $LONGR | samtools view -Sb - > mapped-lg.bam
samtools sort -@$NUMTH -o mapped-lg.sorted.bam mapped-lg.bam
samtools index mapped-lg.sorted.bam
rm mapped-lg.bam

Running Hypo:

首先，创建一个含有short reads的文本文件。

echo -e "$R1\n$R2" > il_names.txt

基因组大约3Gbp（-s），Illumina reads的覆盖度大约55x（-c）. 使用48线程（-t），每个batch运行96个contigs（-p）.

仅仅使用short reads 进行polish：

./bin/hypo -d $DRAFT -r @il_names.txt -s 3g -c 55 -b mapped-sr.sorted.bam -p 96 -t 48 -o whole_genome.h.fa

使用short reads 和long reads进行polish：

./bin/hypo -d $DRAFT -r @il_names.txt -s 3g -c 55 -b mapped-sr.sorted.bam -B mapped-lg.sorted.bam -p 96 -t 48 -o whole_genome.h2.fa

例 2 (using CCS reads)

Mapping the CCS reads to contigs:

$READS：CCS reads；
$Draft：draft contigs；
$NUMTH：线程数；

minimap2 --secondary=no --MD -ax asm20 -t $NUMTH $DRAFT $READS | samtools view -Sb - > mapped-ccs.bam
samtools sort -@$NUMTH -o mapped-ccs.sorted.bam mapped-ccs.bam
samtools index mapped-ccs.sorted.bam
rm mapped-ccs.bam

Running Hypo:

基因组大约3Gbp（-s），CCS reads的覆盖度大约33x（-c）. 使用48线程（-t），一个batch运行全部的contigs.

仅仅 CCS polish:

./bin/hypo -d $DRAFT -r $READS -s 3g -c 30 -b mapped-ccs.sorted.bam -t 48 -o whole_genome.h.fa

Method and Results

对于整个人类基因组（HG002）的组装，仅使用Illumina reads，Hypo花了大约3个小时和大约380G RAM来polish（在48核48线程的机器上）。对于使用Illumina和PacBio reads进行抛光，使用约4小时15分钟和大约410G RAM。方法和部分结果可在BioRxiv找到.

HyPo: Super Fast & Accurate Polisher for Long Read Genome Assemblies
Ritu Kundu, Joshua Casey, Wing-Kin Sung
bioRxiv 2019.12.19.882506; doi: https://doi.org/10.1101/2019.12.19.882506

Limitations

待polish的contig/scaffold中的N将和碱基一样对待，（即，N将被对应POA图中与之align的最频繁的碱基所取代）。
contig/scaffold的最大长度被限制在4,294,967,295（32位无符号整数的最大值）

External Libraries

sdsl-lite 用于秩选择和位向量数据结构.
slog 被用来打印时间和内存使用情况。
suk 被用作计算solid (唯一的) kmers的模块。
POA使用了spoa库的改编版本。

官网：
https://github.com/kensung-lab/hypo

HyPo: Super Fast & Accurate Poli

Installation

Option_1: Conda Package Installation

Option_2: Installation from the source

Pre-requisites

Building Executable

Usage of the tool:

输出文件

中间文件

资源使用

Normalised Edit Distance

例1 (using Illumina paired-end reads)

Mapping the short reads to contigs:

Mapping the long reads to contigs:

Running Hypo:

例 2 (using CCS reads)

Mapping the CCS reads to contigs:

Running Hypo:

Method and Results

Limitations

External Libraries

猜你喜欢

热点阅读