文献阅读 3.0 KaKs_Calculator 3.0：计算编

2022-08-13 本文已影响0人龙star180

期刊

Genomics Proteomics & Bioinformatics （6.409/Q1）

KaKs_Calculator 3.0: Calculating Selective Pressure on Coding and Non-coding Sequences

KaKs_Calculator 3.0：计算编码和非编码序列的选择压力

Abstract

KaKs_Calculator 3.0 is an updated toolkit that is capable of calculating selective pressure on both coding and non-coding sequences. Similar to the nonsynonymous/synonymous substitution rate ratio for coding sequences, selection on non-coding sequences can be quantified as the ratio of non-coding nucleotide substitution rate to synonymous substitution rate of adjacent coding sequences. As testified on empirical data, KaKs_Calculator 3.0 shows effectiveness to detect the strength and mode of selection operated on molecular sequences, accordingly demonstrating its great potential to achieve genome-wide scan of natural selection on diverse sequences and identification of potentially functional elements at whole genome scale. The package of KaKs_Calculator 3.0 is freely available for academic use only at https://ngdc.cncb.ac.cn/biocode/tools/BT000001.

KaKs_Calculator 3.0 是一个更新的工具包，能够计算编码和非编码序列的选择压力。与编码序列的非同义/同义替换率比率类似，对非编码序列的选择可以量化为非编码核苷酸替换率与相邻编码序列的同义替换率的比率。正如经验数据所测试的，KaKs_Calculator 3.0 显示了检测对分子序列进行选择的强度和模式的有效性，因此证明了它在实现对不同序列的自然选择的全基因组扫描和在全基因组尺度上识别潜在功能元件的巨大潜力. KaKs_Calculator 3.0 软件包可在 https://ngdc.cncb.ac.cn/biocode/tools/BT000001 免费获取，仅供学术使用。

KEYWORDS: KaKs_Calculator; Selective pressure; Substitution; Coding; Non-coding

关键词：KaKs_Calculator；选择压力；替代; 编码; 非编码

Introduction

Detecting natural selection on molecular sequences is of fundamental significance in molecular evolution, comparative genomics, and phylogenetic reconstruction, which can provide profound insights for revealing evolutionary processes of molecular sequences and unveiling complex molecular mechanisms of genome evolution. In principle, estimating selection on DNA sequences requires a reference set of substitutions that is free from selection. As synonymous substitutions do not provoke amino acid changes due to the degeneracy of the genetic code, they are expected to be invisible to selection and thus widely used as a reference that reflects the neutral rate of evolution. Consequently, the ratio of nonsynonymous substitution rate (Ka or dN) to synonymous substitution rate (Ks or dS), namely, ω = Ka/Ks (or dN/dS), is widely adopted to differentiate neutral mutation (ω ≈ 1) from negative (purifying) selection (ω < 1) and positive (adaptive) selection (ω > 1), accordingly providing a powerful tool for illuminating molecular evolution of coding sequences (see a popular package in [3]).

检测分子序列的自然选择对于分子进化、比较基因组学和系统发育重建具有重要意义，可以为揭示分子序列的进化过程和揭示基因组进化的复杂分子机制提供深刻的见解。原则上，估计 DNA 序列上的选择需要一组参考的替代，而不是选择。由于同义替换不会因遗传密码的简并性而引起氨基酸变化，因此预计它们对选择是不可见的，因此被广泛用作反映进化中性速率的参考。因此，广泛采用非同义替换率（Ka 或 dN）与同义替换率（Ks 或 dS）之比，即 ω = Ka/Ks（或 dN/dS）来区分中性突变（ω ≈ 1）和负（纯化）选择（ω < 1）和正（适应性）选择（ω > 1），因此为阐明编码序列的分子进化提供了强大的工具（参见[3]中的一个流行包）。

Nowadays, a growing body of evidence has shown that non-coding sequences, historically thought as “junk” due to few knowledge on their function relative to coding sequences, are recognized as functional elements to play important regulation roles in multiple biological processes and associate closely with various human diseases. Albeit less conserved by comparison with coding sequences, a larger number of non-coding sequences have been identified highly conserved across mammalian genomes. Importantly, more non-coding sequences are subject to positive selection and negative selection than previously believed, and particularly, long non-coding RNA (lncRNA) sequences do experience natural selection. As a result, several computational methods have been proposed for the detection of selection acting on non-coding sequences, which primarily differ in how to choose a reference of unconstrained evolution, such as, synonymous substitutions of neighboring coding gene, intron sequences, and ancestral repeats. However, there lacks of an implemented algorithm to detect the strength and mode of selective pressure on non-coding sequences, particularly considering an increasing number of non-coding studies conducted worldwide. More importantly, an integrated toolkit that is capable to detect selection on both coding and non-coding sequences is highly desirable, which would help users achieve genome-wide scan of natural selection on diverse sequences.

如今，越来越多的证据表明，非编码序列，由于对其相对于编码序列的功能知之甚少，在历史上被认为是“垃圾”，被认为是在多种生物过程中发挥重要调节作用并与各种人类疾病密切相关的功能元件。尽管与编码序列相比保守性较低，但已鉴定出大量非编码序列在哺乳动物基因组中高度保守。重要的是，比以前认为的更多的非编码序列受到正选择和负选择的影响，特别是长非编码 RNA (lncRNA) 序列确实经历了自然选择。因此，已经提出了几种计算方法来检测作用于非编码序列的选择，它们的主要区别在于如何选择无约束进化的参考，例如相邻编码基因、内含子序列和祖先重复的同义替换。然而，缺乏一种已实现的算法来检测对非编码序列的选择压力的强度和模式，特别是考虑到全球范围内进行的非编码研究数量不断增加。更重要的是，非常需要一个能够检测编码和非编码序列选择的集成工具包，这将帮助用户实现对不同序列的自然选择的全基因组扫描。

Toward this end, here we present KaKs_Calculator 3.0, an updated toolkit for calculating selective pressure on both coding and non-coding sequences. Compared with previous versions that focus solely on coding sequences, we implement an algorithm in KaKs_Calculator 3.0 that employs synonymous sites of adjacent coding sequences as a reference to estimate selective pressure acting on non-coding sequences. We test it on empirical data and demonstrate its utility in diagnosing the strength and form of molecular evolution.

为此，我们在此介绍 KaKs_Calculator 3.0，这是一个更新的工具包，用于计算编码和非编码序列的选择压力。与仅关注编码序列的先前版本相比，我们在 KaKs_Calculator 3.0 中实现了一种算法，该算法采用相邻编码序列的同义位点作为参考来估计作用于非编码序列的选择压力。 我们根据经验数据对其进行测试，并证明其在诊断分子进化的强度和形式方面的实用性。

Algorithm

The major update of KaKs_Calculator 3.0 is to incorporate an algorithm that is capable of estimating selective pressure on non-coding sequences. Specifically, it uses synonymous substitutions as a reference baseline (similar to [13]), which, albeit thought to be under weak selection, has been widely adopted for determining the strength and type of selection operated on coding sequences. Similar to the Ka/Ks ratio for coding sequences, selective pressure on non-coding sequences (ξ) can be quantified as the ratio of non-coding nucleotide substitution rate (Kn) to neutral substitution rate (assumed as Ks), viz. ξ = Kn/Ks, where Ks is inferred from adjacent coding sequences. As the number of observed substitutions is less than the number of real substitutions, we adopt a nucleotide substitution model (e.g., JC/K2P/HKY) to correct multiple substitutions of non-coding sequences. Taking the HKY model as an example, therefore, Kn can be deduced from the observed transitional and transversional substitutions (S and V, respectively) as well as four nucleotide frequencies (πA, πT, πG, and πC) , according to Equation 1 (see Equations 1.27 and 1.28 in [31]).

KaKs_Calculator 3.0 的主要更新是加入了一种能够估计非编码序列的选择压力的算法。具体来说，它使用同义替换作为参考基线（类似于 [13]），尽管被认为是弱选择，但它已被广泛用于确定对编码序列进行选择的强度和类型。与编码序列的 Ka/Ks 比率类似，对非编码序列 (ξ) 的选择压力可以量化为非编码核苷酸取代率 (Kn) 与中性取代率（假设为 Ks）的比率，即。 ξ = Kn/Ks，其中 Ks 是从相邻的编码序列中推断出来的。由于观察到的替换数量少于实际替换的数量，我们采用核苷酸替换模型（例如，JC/K2P/HKY）来纠正非编码序列的多个替换。因此，以 HKY 模型为例，Kn 可以从观察到的过渡和横向取代（分别为 S 和 V）以及四个核苷酸频率（πA、πT、πG 和 πC）推导出来，根据公式 1（参见 [31] 中的公式 1.27 和 1.28）。

To detect and quantify selection on non-coding sequences, KaKs_Calculator 3.0 provides users with two ways to obtain the value of neutral mutation rate or Ks, which is either calculated from adjacent coding sequences uploaded by users or just specified in a straightforward manner by users (Figure 1). As a consequence, KaKs_Calculator 3.0 is capable to detect selection on both coding and non-coding sequences.

为了检测和量化对非编码序列的选择，KaKs_Calculator 3.0 为用户提供了两种获取中性突变率或 Ks 值的方法，该值可以根据用户上传的相邻编码序列计算得出，也可以由用户直接指定（图1）。因此，KaKs_Calculator 3.0 能够检测编码和非编码序列的选择。

Figure 1

Figure 1 Graphical user interface of KaKs_Calculator 3.0. It contains two panels that are devised for CDS and NCS, respectively. Methods for detecting selection on CDS are classified as: (i) approximate methods: NG by Nei and Gojobori (1986), LWL by Li, Wu, and Luo (1985), LPB by Li (1993) & Pamilo and Bianchi (1993), MLWL (Modified LWL) & MLPB (Modified LPB) by Tzeng, Pan, and Li (2004), YN by Yang and Nielsen (2000), MYN (Modified YN) by Zhang, Li, and Yu (2006); (ii) maximum-likelihood methods: GY by Goldman and Yang (1994), and MS (Model Selection) & MA (Model Averaging) by Zhang et al (2006). Ka, nonsynonymous substitution rate; Ks, synonymous substitution rate; Kn, non-coding nucleotide substitution rate; Ka/Ks, selective pressure on CDS; Kn/Ks, selective pressure on NCS; CDS, coding sequences; NCS, non-coding sequences; GC, Genetic code.

图 1 KaKs_Calculator 3.0 的图形用户界面。它包含两个面板，分别为 CDS 和 NCS 设计。检测 CDS 选择的方法分为： (i) 近似方法：Nei 和 Gojobori (1986) 的 NG，Li、Wu 和 Luo (1985) 的 LWL，Li (1993) 和 Pamilo 和 Bianchi (1993) 的 LPB , MLWL (Modified LWL) & MLPB (Modified LPB) by Tzeng, Pan, and Li (2004), YN by Yang and Nielsen (2000), MYN (Modified YN) by Zhang, Li, and Yu (2006); (ii) 最大似然法：Goldman 和 Yang (1994) 的 GY，以及 Zhang 等人 (2006) 的 MS（模型选择）和 MA（模型平均）。 Ka，非同义替换率； Ks，替代率的代名词； Kn，非编码核苷酸取代率； Ka/Ks，对 CDS 的选择压力； Kn/Ks，NCS 上的选择压力； CDS，编码序列； NCS，非编码序列； GC，遗传密码。

KaKs_Calculator 3.0 is implemented in standard C++ language, enabling higher efficiency and easy compilation on different operation systems (Linux/Windows/Mac). In addition to the new functionality for estimating selection on non-coding sequences as mentioned above, it is also updated by fixing bugs and errors. The package of KaKs_Calculator 3.0, including compiled executables, a Windows application with graphical user interface (GUI), source codes, and example data, accompanying with detailed instructions and documentation, is freely available for academic use only at BioCode (https://ngdc.cncb.ac.cn/biocode/tools/BT000001), an open-source platform for archiving bioinformatics tools in the National Genomics Data Center (NGDC), China National Center for Bioinformation.

KaKs_Calculator 3.0 采用标准 C++ 语言实现，在不同操作系统（Linux/Windows/Mac）上实现更高的效率和易于编译。除了上述用于估计非编码序列选择的新功能外，它还通过修复错误和错误进行了更新。 KaKs_Calculator 3.0 包，包括编译的可执行文件、带有图形用户界面 (GUI) 的 Windows 应用程序、源代码和示例数据，以及详细的说明和文档，仅在 BioCode (https://ngdc.cncb.ac.cn/biocode/tools/BT000001)上免费提供给学术用途，中国国家生物信息中心国家基因组学数据中心（NGDC）生物信息学工具归档的开源平台。

Application on empirical data

To test KaKs_Calculator 3.0, we choose three empirical lncRNA genes that are extensively studied according to LncRNAWiki and collect their human-mouse orthologs as well as their adjacent coding orthologs from NGDC LncBook and National Center of Biotechnology Information (NCBI) RefSeq. Specifically, these non-coding and coding gene symbols with accession numbers are: (1) H19 (NR_002196.2 vs. NR_130973.1) and MRPL23 (NM_021134.4 vs. NM_011288.2); (2) Metastasis-associated lung adenocarcinoma transcript 1 (MALAT1) (NR_002819.4 vs. NR_002847.3) and SCYL1 (NM_020680.4 vs. NM_001361921.1); and (3) Hox transcript antisense intergenic RNA (HOTAIR) (NR_003716.3 vs. NR_047528.1) and HOXC12 (NM_173860.3 vs. NM_010463.2). Based on these orthologous genes, we obtain their corresponding aligned sequences by MAFFT (using parameters: --maxiterate 1000 --localpair).

为了测试 KaKs_Calculator 3.0，我们选择了三个根据 LncRNAWiki 进行了广泛研究的经验 lncRNA 基因，并从 NGDC LncBook 和国家生物技术信息中心 (NCBI) RefSeq 收集了它们的人鼠直系同源物以及它们相邻的编码直系同源物。具体来说，这些带有登录号的非编码和编码基因符号是：（1）H19（NR_002196.2 vs. NR_130973.1）和MRPL23（NM_021134.4 vs. NM_011288.2）； (2) 转移相关肺腺癌转录本 1 (MALAT1) (NR_002819.4 vs. NR_002847.3) 和 SCYL1 (NM_020680.4 vs. NM_001361921.1)； (3) Hox 转录反义基因间 RNA (HOTAIR) (NR_003716.3 vs. NR_047528.1) 和 HOXC12 (NM_173860.3 vs. NM_010463.2)。 基于这些直系同源基因，我们通过MAFFT获得它们对应的比对序列（使用参数：--maxiterate 1000 --localpair）。

According to the ratio (ξ) of non-coding nucleotide substitution rate to adjacent synonymous substitution rate, we reveal that, although the coding genes undergo strong purifying selection (ω < 1), these three non-coding genes present diverse selective pressure (Table 1). Strikingly, HOTAIR exhibits positive selection (ξ > 1), whereas the rest two genes experience negative selection (ξ < 1). HOTAIR is a ~2.3-kb intergenic RNA transcribed from the antisense strand of the HOXC gene cluster. The result of positive selection detected on HOTAIR relative to HOXC12 is consistent well with previous findings that HOTAIR evolves faster than the neighbouring genes. On the contrary, MALAT1, a ~8.7-kb non-coding RNA flanked by the highly conserved kinase-like gene SCYL1, is ubiquitously expressed in almost all human tissues, evolutionarily conserved across mammalian species, and associated with various cancers. Thus, ξ = 0.464 indicates strong selective constraint on MALAT1, in accordance with its physiologic and pathophysiological function and conserved RNA structure as documented by previous studies. Likewise, H19, a ~2.3-kb imprinted maternally expressed transcript located near MRPL23, is known for close association with Beckwith-Wiedemann Segnhyndrome and also involved in tumorigenesis. Our result shows that H19 presents stronger selection constraint as indicated by ξ = 0.296, conforming well with its conserved sequence and structure. It is worth noting that one non-coding sequence may have multiple adjacent coding genes, which are specified by users and thus can lead to different estimates of Ks and ξ. Taken together, KaKs_Calculator 3.0 is effective in estimating natural selection on non-coding sequences, which has the potential to reveal evolutionarily selective pressures operated on diverse molecular sequences.

根据非编码核苷酸替换率与相邻同义替换率的比值（ξ），我们发现虽然编码基因经历了强烈的纯化选择（ω < 1），但这三个非编码基因呈现出不同的选择压力（表1）。引人注目的是，HOTAIR 表现出正选择（ξ > 1），而其余两个基因经历负选择（ξ < 1）。 HOTAIR 是从 HOXC 基因簇的反义链转录而来的约 2.3-kb 基因间 RNA。在 HOTAIR 上检测到的相对于 HOXC12 的正选择结果与之前的发现非常一致，即 HOTAIR 比邻近基因进化得更快。相反，MALAT1 是一种约 8.7 kb 的非编码 RNA，两侧是高度保守的激酶样基因 SCYL1，在几乎所有人体组织中普遍表达，在哺乳动物物种中进化保守，并与各种癌症相关。因此，ξ = 0.464 表明对 MALAT1 的强选择性约束，根据其生理和病理生理功能以及先前研究记录的保守 RNA 结构。同样，H19 是一种位于 MRPL23 附近的约 2.3-kb 印记母体表达的转录本，它与 Beckwith-Wiedemann 综合征密切相关，并且也参与肿瘤发生。我们的结果表明，H19 呈现出更强的选择约束，如 ξ = 0.296 所示，与其保守的序列和结构非常吻合。值得注意的是，一个非编码序列可能有多个相邻的编码基因，这些基因是由用户指定的，因此会导致 Ks 和 ξ 的估计不同。总之，KaKs_Calculator 3.0 在估计非编码序列的自然选择方面是有效的，这有可能揭示对不同分子序列进行的进化选择压力。

Table 1

Table 1 Estimates of selective pressure as well as substitution rates in human-mouse orthologs

表 1 人-鼠直系同源物选择压力和替代率的估计值

Note: Ka, nonsynonymous substitution rate; Ks, synonymous substitution rate; Kn, non-coding nucleotide substitution rate; ω, selective pressure on coding sequence; ξ, selective pressure on non-coding sequence.

注：Ka，非同义替代率； Ks，同义替代率； Kn，非编码核苷酸取代率； ω，编码序列的选择压力； ξ，对非编码序列的选择压力。

In addition, to test the running performance of KaKs_Calculator, we collect an empirical large dataset that contains 15,424 human-mouse orthologous genes retrieved from RefSeq and obtain their codon-based alignments by ParaAT—a parallel tool for constructing multiple protein-coding DNA alignments. KaKs_Calculator 3.0 includes ten computational methods for detecting selection on coding sequences, which fall into approximate methods and maximum-likelihood methods. We choose three approximate methods, NG, YN, and MYN, and one maximum-likelihood method, GY, and test on a 64 bit x86 Intel Core i7 machine containing 4 CPU cores with each 3.40 GHz and running Windows 10. For this large-scale data analysis, we find that NG, YN, and MYN all take ~2 min and GY takes ~11 h, clearly showing that approximate methods are more time-efficient than maximum-likelihood ones. Considering that different users may have different preferences, it should be noted, however, that maximum-likelihood methods are believed to achieve higher accuracy and that different methods adopt different models and strategies and thus can lead to different estimates (see an example in where contradictory findings are produced by different methods).

此外，为了测试 KaKs_Calculator 的运行性能，我们收集了一个经验性大数据集，其中包含从 RefSeq 检索到的 15,424 个人鼠直系同源基因，并通过 ParaAT（一种用于构建多个蛋白质编码 DNA 比对的并行工具）获得它们基于密码子的比对。 KaKs_Calculator 3.0包括十种用于检测编码序列选择的计算方法，分为近似方法和最大似然方法。我们选择了三种近似方法 NG、YN 和 MYN，以及一种最大似然方法 GY，并在 64 位 x86 Intel Core i7 机器上进行测试，该机器包含 4 个 CPU 内核，每个 3.40 GHz 并运行 Windows 10。对于这个大型-规模数据分析，我们发现 NG、YN 和 MYN 都需要约 2 分钟，GY 需要约 11 小时，这清楚地表明近似方法比最大似然方法更省时。然而，考虑到不同的用户可能有不同的偏好，应该注意的是，最大似然法被认为可以实现更高的准确性，并且不同的方法采用不同的模型和策略，因此可能导致不同的估计（参见一个例子结果是通过不同的方法产生的）。

Discussion

KaKs_Calculator 3.0 is significantly updated by achieving the detection of natural selection on non-coding sequences as well as coding sequences. As testified on empirical data, it is of great utility in calculating natural selection on molecular sequences and thus identifying potentially functional elements at genome-wide scale. Future developments include the detection of selective pressure on small peptides (less than 300 nucleotides) that are encoded by small open reading frames within non-coding sequences as well as the implementation of codon-based alignment procedure to help users generate input sequences in an easy-to-use manner.

KaKs_Calculator 3.0 通过实现对非编码序列和编码序列的自然选择检测进行了重大更新。 正如经验数据所证明的那样，它在计算分子序列的自然选择方面非常有用，从而在全基因组范围内识别潜在的功能元件。 未来的发展包括检测由非编码序列中的小开放阅读框编码的小肽（少于 300 个核苷酸）的选择性压力，以及实施基于密码子的比对程序，以帮助用户用一种容易使用的方式生成输入序列。

Acknowledgments

I would like to extend special thanks to Lina Ma for constructive suggestions and discussions on this work and Zhao Li for valuable help on data collection and test. I also thank Zhuojing Fan for designing the logo as well as Qing Guo and Lin Dai for fixing a bug on Windows GUI. I am extremely grateful to a number of users for reporting bugs and sending comments since the first release of KaKs_Calculator in 2006. This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDA19050302), the National Natural Science Foundation of China (Grant No. 31871328 and 32030021), National Key R&D Program of China (Grant No. 2017YFC0907502), and International Partnership Program of the Chinese Academy of Sciences (Grant No. 153F11KYSB20160008).

特别感谢 Lina Ma 对这项工作的建设性建议和讨论，以及 Zhao Li 在数据收集和测试方面提供的宝贵帮助。 我还要感谢范卓靖设计的标志以及青郭和林戴修复 Windows GUI 上的错误。 非常感谢自 2006 年 KaKs_Calculator 首次发布以来，许多用户报告错误和发送评论。这项工作得到了中国科学院战略重点研究计划（批准号 XDA19050302）、国家自然科学基金、中国科学基金（31871328和32030021），国家重点研发计划（2017YFC0907502），中国科学院国际合作计划（153F11KYSB20160008）的支持。

文献阅读 3.0 KaKs_Calculator 3.0：计算编

猜你喜欢

热点阅读