算法文献阅读2:DIAMOND(2021年版)
书接上期,本期读者们跟着笔者阅读DIAMOND的最新版。还是发的Nature methods
Sensitive protein alignments at tree-of-life scale using DIAMOND
使用 DIAMOND 在生命树尺度上进行灵敏的蛋白质比对
We are at the beginning of a genomic revolution in which all known species are planned to be sequenced. Accessing such data for comparative analyses is crucial in this new age of data-driven biology. Here, we introduce an improved version of DIAMOND that greatly exceeds previous search performances and harnesses supercomputing to perform tree-of-life scale protein alignments in hours, while matching the sensitivity of the gold standard BLASTP.
我们正处于基因组革命的开端,计划对所有已知物种进行测序。 在这个数据驱动生物学的新时代,获取这些数据进行比较分析至关重要。 在这里,我们介绍了 DIAMOND 的改进版本,它大大超过了以前的搜索性能,并利用超级计算在数小时内执行生命树规模的蛋白质比对,同时与黄金标准 BLASTP 的灵敏度相匹配。(牛逼啊,牛逼)
Within the next decade, The Earth BioGenome Project1,2 aims to sequence and assemble the reference genomes for ~1.5million of the 10–15million known eukaryotic species that inhabit our planet. Current sequence search algorithms and software tools would be impractical for analyzing data of this magnitude when aiming to retain sensitivity similar to the gold standard BLASTP3. In an experimental study we estimated that querying the National Center for Biotechnology information (NCBI) non-redundant (nr) database (~280million sequences) against the UniRef50 database (~40million sequences) using BLASTP would require more than 2months even on a supercomputer equipped with 20,800 cores (Methods). However, the newly developed version of DIAMOND can now accomplish the same task in several hours, with an alignment sensitivity that matches BLAST. We overcome this computational bottleneck and enable sensitive large-scale protein searches on a tree-of-life scale by introducing improved algorithmic procedures and a customized high-performance computing framework, which incorporate optimized distributed computing, double indexing and multiple spaced seeding. This version of DIAMOND is available as Open Source Software under the GPL3 license (http://www.diamondsearch.org).
在未来十年内,地球生物基因组计划旨在为居住在我们星球上的1000-1500万个已知真核物种中的约150万个参考基因组进行测序和组装。当旨在保持与金标准BLASTP相似的灵敏度时,当前的序列搜索算法和软件工具对于分析这种量级的数据是不切实际的。在一项实验研究中,我们估计使用 BLASTP 查询国家生物技术信息中心 (NCBI) 非冗余 (nr) 数据库(约 2.8 亿个序列)与 UniRef数据库(约4000万个序列)即使在配备有超级计算机的情况下也需要2个多月的时间。具有20,800个内核。然而,新开发的DIAMOND版本现在可以在几个小时内完成相同的任务,其对齐灵敏度与 BLAST 相匹配。我们通过引入改进的算法程序和定制的高性能计算框架,克服了这一计算瓶颈,并在生命树尺度上实现了敏感的大规模蛋白质搜索,该框架结合了优化的分布式计算、双索引和多间隔播种。此版本的 DIAMOND 在 GPL3 许可 (http://www.diamondsearch.org) 下作为开源软件提供。
DIAMOND is a fast and sensitive protein aligner that was initially developed for metagenomics applications to achieve ultra-fast alignments at the cost of alignment sensitivity, compared with the gold standard, BLAST. Although DIAMOND is proven to be practical for many metagenomics studies that also often rely on k-mer information for annotation and classification4, most functional and phylogenomic studies rely heavily on high alignment sensitivity to obtain useful insights about the functional conservation of proteins and their evolutionary divergence along phylogenetic lineages. For data-intensive studies in these fields, BLAST remains the tool of choice due to its paramount alignment sensitivity.
DIAMOND 是一种快速灵敏的蛋白质比对器,最初是为宏基因组学应用而开发的,与金标准 BLAST 相比,以比对灵敏度为代价实现超快速比对。 尽管 DIAMOND 已被证明对许多也经常依赖 k-mer 信息进行注释和分类的宏基因组学研究是实用的,但大多数功能和系统基因组研究都依赖于高度的比对敏感性,以获得关于蛋白质功能保守性及其在系统发育谱系中的进化差异的有用见解。对于这些领域的数据密集型研究,BLAST仍然是首选的工具,因为它具有最高的对齐灵敏度。
Here, we introduce a greatly improved version of DIAMOND that provides two sensitivity modes, --very-sensitive and --ultrasensitive, which will enable data-intensive comparative genomics research such as tree-of-life scale tracing of protein evolution5, gene age inference6,7, and functional annotation of genes and gene families8 to be carried out with the same accuracy as BLAST, but with an 80–360-fold computational speedup. In --ultrasensitive mode, DIAMOND (v2.0.7) achieves this BLAST-like sensitivity milestone while reducing the computational run time of BLASTP-heavy studies from months to hours.
在这里,我们介绍了一个大大改进的 DIAMOND 版本,它提供了两种灵敏度模式,--very-sensitive 和--ultrasensitive,这将使数据密集型比较基因组学研究成为可能,例如蛋白质进化的生命树规模追踪、基因年龄推断以及基因和基因家族的功能注释的准确性与 BLAST 相同,但计算速度提高了 80-360 倍。 在 --ultrasensitive 模式下,DIAMOND (v2.0.7) 实现了类似 BLAST 的灵敏度里程碑,同时将 BLASTP 繁重研究的计算运行时间从数月缩短到数小时。
This version of DIAMOND is different from other protein aligners and its older versions in that it focuses on ultra-fast but sensitive protein searches that can scale with sequencing efforts; for example, to meet the demands of the large-scale Earth BioGenome Project and analogous bulk sequencing projects. Alternative tools such as BLASTP (ref. 3), USearch (ref. 9), LAST (ref. 10) or MMSeqs2 (ref. 11) are also optimized to run fast protein alignments, but still require longer computation times and, with the exception of BLAST, are less sensitive than DIAMOND when dealing with very large datasets. These tools already experience limitations when they try to handle searches at the scale of the NCBI nr database, which currently contains the largest collection of sequences, representing genomic information for ~12,000 eukaryotic species. Therefore we sought to build a protein search infrastructure that can accommodate the demand of sensitive homology searches on this exponentially growing dataset of sequenced species.
这个版本的 DIAMOND 与其他蛋白质比对仪及其旧版本不同,它专注于超快速但灵敏的蛋白质搜索,可以随着测序工作扩展;例如,为了满足大型地球生物基因组计划和类似的批量测序项目的需求。 BLASTP、USearch、LAST或 MMSeqs2等替代工具也经过优化以运行快速蛋白质比对,但仍需要更长的计算时间,并且除了BLAST,在处理非常大的数据集时,它的敏感性不如 DIAMOND。这些工具在尝试处理 NCBI nr 数据库规模的搜索时已经遇到限制,该数据库目前包含最大的序列集合,代表约 12,000 个真核物种的基因组信息。因此,我们试图建立一个蛋白质搜索基础设施,以适应对这个呈指数增长的测序物种数据集进行敏感同源性搜索的需求。
DIAMOND now achieves this goal by providing four different levels of alignment sensitivity and by optimizing two distinct computational paradigms. First, it leverages an ultra-fast integration of algorithmic steps optimized for the latest generation of computer architectures that are designed to function optimally when dealing with massive query and subject databases. Second, it harnesses high-performance computing (HPC) and cloud computing by providing a powerful distributed computing implementation customized for large-scale protein searches, incorporating our new DIAMOND search scheme (Methods). In summary, our method is based upon on-the-fly double indexing (in which both the reference database and the query are indexed for comparison) and hash join on the seed space spanned by up to 64 multiple spaced seeds (seeds that are extracted from the sequence according to a pattern of ‘match’ and ‘don’t care’ positions) to greatly improve the specificity of seeding relative to a baseline strategy. Furthermore, double indexing focuses the comparison operations with respect to a seed and enables the operations to be streamed through the CPU in an efficient, cache-aware manner, avoiding the memory latency bottleneck of a classical single-indexed seed lookup approach. A chain of heuristic filter stages that makes heavy use of vector instructions is designed to gradually eliminate spurious hits, while passing on potentially significant alignments to a vectorized Smith–Waterman extension.
DIAMOND 现在通过提供四种不同级别的对齐灵敏度和优化两种不同的计算范式来实现这一目标。首先,它利用了针对最新一代计算机架构优化的算法步骤的超快速集成,这些架构旨在在处理大量查询和主题数据库时发挥最佳功能。其次,它利用高性能计算 (HPC) 和云计算,提供为大规模蛋白质搜索定制的强大分布式计算实现,并结合我们新的 DIAMOND 搜索方案。总之,我们的方法基于动态双索引(其中参考数据库和查询都被索引以进行比较)和种子空间上的哈希连接,该种子空间由多达 64 个多间隔种子(提取的种子根据“匹配”和“不关心”位置的模式从序列中提取),以大大提高相对于基线策略的播种特异性。此外,双索引专注于与种子相关的比较操作,并使操作能够以高效、缓存感知的方式通过 CPU 流式传输,避免了经典单索引种子查找方法的内存延迟瓶颈。大量使用向量指令的启发式过滤器阶段链旨在逐渐消除虚假命中,同时将潜在的重要对齐传递给向量化的 Smith-Waterman 扩展。
We demonstrate the search capabilities of DIAMOND (v2.0.7) by systematically comparing its performance against BLASTP (v2.10.0) and MMSeqs2 (release 11), and against an older version of DIAMOND (v0.7.12), all of which are currently the most promising alternatives for sensitive tree-of-life scale protein searches (Fig. 1). To create a benchmark dataset covering annotated protein sequences spanning the full diversity of the tree of life, we downloaded the NCBI nr database (25 October 2019) and annotated each protein sequence in accordance with their SCOPe (structural classification of proteins–extended) domains12 (http://scop.berkeley.edu/) (Methods). Establishing a ground truth on the basis of SCOP domains has been considered the gold standard for benchmarking protein aligners13. As a result of the annotation, we obtained a query dataset of ~1.7million protein sequences covering ~1,000 representative sequences for each SCOPe superfamily. Furthermore, we annotated the UniRef50 database14 (accessed 14 September 2019) following the same procedure to serve as a reference database for the benchmark.
我们通过系统地比较其与 BLASTP (v2.10.0) 和 MMSeqs2 (release 11) 以及旧版本 DIAMOND (v0.7.12) 的性能来展示 DIAMOND (v2.0.7) 的搜索能力,所有这些都是目前的敏感的生命树规模蛋白质搜索的最有希望的替代品。为了创建一个涵盖跨越生命之树全部多样性的注释蛋白质序列的基准数据集,我们下载了 NCBI nr 数据库(2019 年 10 月 25 日)并根据其 SCOPe(蛋白质的结构分类 - 扩展)域注释每个蛋白质序列(http://scop.berkeley.edu/)。在SCOP域的基础上建立一个基础真理,已被认为是衡量蛋白质对准器的黄金标准。作为注释的结果,我们获得了约 170 万个蛋白质序列的查询数据集,涵盖每个 SCOPe 超家族的约 1,000 个代表性序列。此外,我们按照相同的程序对 UniRef 数据库(2019 年 9 月 14 日访问)进行了注释,作为基准的参考数据库。
图1Fig. 1 | Benchmark of DIAMOND, MMSeqs2 and BLASTP using various sensitivity modes. Computational speedup and alignment sensitivity comparisons were carried out between the new version of DIAMOND, v2.0.7 (using default, --sensitive, --very-sensitive and --ultra-sensitive modes), the old version of DIAMOND, v0.7.12 (using default and --sensitive modes), MMSeqs2 release 11 (using modes s= 1.0, s= 2.5, s= 6.0, s= 7.5, s= 7.5** with --max-seqs 100000), BLASTP v2.10.0 and QuickBLAST v0.0.0. a, Alignment sensitivity (AUC1) measured as the fraction of the query’s protein family covered until the first false positive, averaged over all queries in the benchmark dataset. Dashed vertical line, alignment sensitivity level of BLASTP v2.10.0 (AUC1 = 0.622). b, ROC curves of the same benchmark showing the true average error rate per query versus the average coverage of the protein family, depending on the e-value threshold.
图 1 | DIAMOND、MMSeqs2 和 BLASTP 使用各种灵敏度模式的基准测试。 在新版 DIAMOND v2.0.7(使用默认、--sensitive、--very-sensitive 和--ultra-sensitive 模式)与旧版 DIAMOND, v0.7.12 之间进行计算加速和对齐灵敏度比较 (使用默认和 --sensitive 模式),MMSeqs2 版本 11(使用模式 s= 1.0,s= 2.5,s= 6.0,s= 7.5,s= 7.5** 和 --max-seqs 100000),BLASTP v2.10.0 和 QuickBLAST v0.0.0。 a,比对敏感度(AUC1)测量为查询的蛋白质家族在第一个误报之前覆盖的分数,对基准数据集中的所有查询进行平均。 垂直虚线,BLASTP v2.10.0 的对齐灵敏度级别 (AUC1 = 0.622)。 b,同一基准的 ROC 曲线显示每个查询的真实平均错误率与蛋白质家族的平均覆盖率,取决于 e 值阈值。
It is important to note that some previous performance benchmarks between older versions of DIAMOND and other aligners15 used small benchmark datasets for the comparison with DIAMOND. As stated earlier, DIAMOND is optimized for searches using large query and reference databases. Therefore valuable benchmarking insights can only be achieved when comparing DIAMOND and other tools using large benchmark datasets, rather than focusing on small query or reference examples.
值得注意的是,一些以前的性能基准测试在旧版本的DIAMOND和其他对齐器之间使用小的基准数据集与DIAMOND进行比较。如前所述,DIAMOND针对使用大型查询和参考数据库的搜索进行了优化。因此,只有在使用大型基准数据集比较DIAMOND和其他工具时,才能获得有价值的基准分析见解,而不是专注于小的查询或参考示例。
We ran DIAMOND (v2.0.7) in all four sensitivity modes using our SCOPe-annotated benchmark dataset as a query against the UniRef50 database, and compared its computational performance and level of sensitivity against analogous runs performed with BLASTP (v2.10.0), MMSeqs2 (release 11) and DIAMOND (v0.7.12). Figure 1 shows the benchmarking results of these alignments against the UniRef50 database. For each tool, we show the performance increase of the respective search algorithm over BLASTP against the average recall of a query’s protein family until the first false positive (Fig. 1a), and the corresponding receiver operating characteristic (ROC) curve (Fig. 1b). We found that DIAMOND (v2.0.7) computed alignments 12–15-fold faster than MMSeqs2 (release 11) while maintaining similar sensitivity levels. When the new DIAMOND was compared with older versions of DIAMOND16 (v0.7.12) we observed a 6–8-fold speedup, while the old DIAMOND was also far behind the other benchmarked tools in terms of sensitivity. When comparing DIAMOND (v2.0.7) to BLASTP (v2.10.0) we observed an ~8,000 fold speedup when using the least sensitive mode, and still an 80-fold speedup when running DIAMOND (v2.0.7) with a sensitivity level matching that of BLASTP (ultra-sensitive mode). Closer inspection of the trade-off between sensitivity and specificity on the basis of ROC curves (Fig. 1b) shows that DIAMOND (v2.0.7) in both the very-sensitive and ultra-sensitive modes maintained equal or marginally better sensitivity than BLAST at low error rates, while being only slightly surpassed by BLAST at error rates of >1 false positive per query (in which searches at error rates of >1 have only rare practical applications). We also conclude that the more sophisticated repeat masking used by DIAMOND (v2.0.7) (Methods) enables lower true error rates at similar sensitivity levels.
我们使用我们的 SCOPe 注释基准数据集作为对 UniRef数据库的查询,在所有四种灵敏度模式下运行 DIAMOND (v2.0.7),并将其计算性能和灵敏度水平与使用 BLASTP (v2.10.0)、MMSeqs2 执行的类似运行进行比较(第 11 版)和 DIAMOND (v0.7.12)。图 1 显示了这些对齐对 UniRef50 数据库的基准测试结果。对于每个工具,我们展示了相应搜索算法在 BLASTP 上相对于查询的蛋白质家族的平均召回率的性能提升,直到第一次假阳性(图 1a),以及相应的接收器操作特征(ROC)曲线(图 1b) )。我们发现 DIAMOND (v2.0.7) 计算比对的速度比 MMSeqs2(版本 11)快 12-15 倍,同时保持相似的灵敏度水平。当新的 DIAMOND 与旧版本的 DIAMOND16 (v0.7.12) 进行比较时,我们观察到 6-8 倍的加速,而旧 DIAMOND 在灵敏度方面也远远落后于其他基准工具。在将 DIAMOND (v2.0.7) 与 BLASTP (v2.10.0) 进行比较时,我们观察到在使用最低灵敏度模式时加速了约 8,000 倍,而在运行灵敏度级别匹配的 DIAMOND (v2.0.7) 时仍然加速了 80 倍BLASTP(超灵敏模式)。根据 ROC 曲线(图 1b)对灵敏度和特异性之间的权衡进行更仔细的检查表明,DIAMOND(v2.0.7)在非常敏感和超敏感模式下保持与 BLAST 相同或略好于 BLAST 的灵敏度错误率低,而 BLAST 在每个查询的误报率 > 1 时仅略微超过(其中错误率 > 1 的搜索只有很少的实际应用)。我们还得出结论,DIAMOND (v2.0.7) 使用的更复杂的重复掩蔽可以在相似的灵敏度水平下降低真实错误率。(这一段是对图一数据集测试比较的一个描述)
In addition, we compared older versions of BLASTP (v2.2.31; 2015) to the 2019 version of BLASTP (v2.10.0) and found that the 2019 version of BLASTP was fourfold faster than its 2015 version. Although this speedup is impressive, we are not able to envision a scenario in which this rate of increase will enable tree-of-life scale protein alignments when dealing with sequences from millions of eukaryotic species.
此外,我们将旧版本的BLASTP(v2.2.31;2015)与2019年版本的BLASTP(v2.10.0)进行了比较,发现2019年版本的BLASTP比其2015年版本快了四倍。虽然这个速度令人印象深刻,但我们无法设想在处理来自数百万真核生物物种的序列时,这个速度的提高将使生命树规模的蛋白质比对成为可能。
To demonstrate the capabilities of our tool when supported by an HPC infrastructure, we aligned all 281million protein sequences from the NCBI nr database against the UniRef50 database, which consists of 39million sequences, using DIAMOND (v2.0.7) in ultra-sensitive mode on the Cobra supercomputer of the Max Planck Society. This comprehensive comparison across all domains of life was computed in less than 18hours using 520 compute nodes (Fig. 2 and Extended Data Fig. 1), compared with an estimated 2months with BLAST.
为了证明我们的工具在高性能计算基础设施支持下的能力,我们使用DIAMOND(v2.0.7)在马克斯-普朗克协会的Cobra超级计算机上以超灵敏模式将NCBI nr数据库中的全部2.81亿条蛋白质序列与UniRef50数据库中的3900万条序列进行比对。使用520个计算节点,在不到18小时的时间内完成了对所有生命领域的全面比较(图2和扩展数据图1),而使用BLAST估计需要两个月的时间。
自我总结:
all in all, 作者极力比较了并证明了DIAMOND version 2的优势:超快计算速度,并跟金标准BLASTP相当的灵敏度。全文都是突出DIAMOND version 2的优势在于海量数据的超快速度比较。又不失去灵敏度。自我感觉:如果在时间允许又想追求较高的灵敏度,参数应该设置为--ultrasensitive(因为这个参数的灵敏度与BLASTP相当)。
important references
Smith, T. F. & Waterman, M. S. Identifcation of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
最后:
DIAMOND的网址:https://github.com/bbuchfink/diamond