苜蓿

文献阅读4.0 豌豆 (Pisum sativum L.) 基因

2022-08-17  本文已影响0人  龙star180

期刊

BMC Genomics (4.547/Q2)

Repetitive DNA in the pea (Pisum sativum L.) genome: comprehensive characterization using 454 sequencing and comparison to soybean and Medicago truncatula(2007)

豌豆 (Pisum sativum L.) 基因组中的重复 DNA:使用 454 测序的综合表征以及与大豆和蒺藜苜蓿的比较

Abstract

Background: Extraordinary size variation of higher plant nuclear genomes is in large part caused by differences in accumulation of repetitive DNA. This makes repetitive DNA of great interest for studying the molecular mechanisms shaping architecture and function of complex plant genomes. However, due to methodological constraints of conventional cloning and sequencing, a global description of repeat composition is available for only a very limited number of higher plants. In order to provide further data required for investigating evolutionary patterns of repeated DNA within and between species, we used a novel approach based on massive parallel sequencing which allowed a comprehensive repeat characterization in our model species, garden pea (Pisum sativum).

Results: Analysis of 33.3 Mb sequence data resulted in quantification and partial sequence reconstruction of major repeat families occurring in the pea genome with at least thousands of copies. Our results showed that the pea genome is dominated by LTR-retrotransposons, estimated at 140,000 copies/1C. Ty3/gypsy elements are less diverse and accumulated to higher copy numbers than Ty1/copia. This is in part due to a large population of Ogre-like retrotransposons which alone make up over 20% of the genome. In addition to numerous types of mobile elements, we have discovered a set of novel satellite repeats and two additional variants of telomeric sequences. Comparative genome analysis revealed that there are only a few repeat sequences conserved between pea and soybean genomes. On the other hand, all major families of pea mobile elements are well represented in M. truncatula. 

Conclusion: We have demonstrated that even in a species with a relatively large genome like pea, where a single 454-sequencing run provided only 0.77% coverage, the generated sequences were sufficient to reconstruct and analyze major repeat families corresponding to a total of 35–48% of the genome. These data provide a starting point for further investigations of legume plant genomes based on their global comparative analysis and for the development of more sophisticated approaches for data mining.

背景:高等植物核基因组的异常大小变异在很大程度上是由重复 DNA 积累的差异引起的。这使得重复 DNA 对研究影响复杂植物基因组结构和功能的分子机制产生了极大的兴趣。然而,由于常规克隆和测序的方法学限制,重复组成的全局描述仅适用于非常有限数量的高等植物。为了提供研究物种内和物种间重复 DNA 进化模式所需的进一步数据,我们使用了一种基于大规模平行测序的新方法,该方法允许在我们的模型物种花园豌豆 (Pisum sativum) 中进行全面的重复表征。

结果:对 33.3 Mb 序列数据的分析导致豌豆基因组中出现的主要重复家族的量化和部分序列重建,其中至少有数千个拷贝。 我们的研究结果表明,豌豆基因组以 LTR 逆转录转座子为主,估计为 140,000 拷贝/1C。 与 Ty1/copia 相比,Ty3/gypsy 元素的多样性和积累到更高的拷贝数。 这部分是由于大量类似食人魔的反转录转座子仅占基因组的 20% 以上。 除了多种类型的移动元件外,我们还发现了一组新的卫星重复序列和两个额外的端粒序列变体。 比较基因组分析表明,豌豆和大豆基因组之间只有少数重复序列是保守的。 另一方面,豌豆移动元素的所有主要家族都在蒺藜苜蓿中得到了很好的体现。

结论:我们已经证明,即使在一个基因组相对较大的物种,如豌豆,单次454测序仅提供0.77%的覆盖率,生成的序列也足以重建和分析35-48%的基因组对应的主要重复家族。这些数据为进一步研究基于全球比较分析的豆类植物基因组和开发更复杂的数据挖掘方法提供了一个起点。

Background

Understanding evolutionary mechanisms shaping complex genomes of eukaryotes is impossible without thorough investigation of repeated genomic sequences [1-4]. This is especially obvious in higher plants, where repetitive sequences comprise up to 97% of nuclear DNA [5,6] and contribute significantly to the extraordinary genome size variation observed between different taxa [7-9]. However, the presence of numerous and sequentially diverse families of repetitive elements make their analysis a challenging task. Thus, the most widely used approaches to study the contribution of repetitive DNA to genome evolution are based on isolation and characterization of only a single or a small group of elements. These approaches have been valuable in following the fate of various repeats in a wide range of species [10-12]. However, they do not allow for the global comparative analysis of repeat profiles required for elucidating evolutionary trends on the whole genome level. The demand for a comprehensive repeat analysis prompted the development of a DNA microarray-based assay to study the occurrence of hundreds of repeats in twenty plant genomes [13]. Although successful, the microarray-based approach still suffered from several limitations including the need for a priori knowledge of the repeat sequences, the limited capacity of the array, and especially the inability to discover novel repeats for which there were no probes on the array.

如果不彻底研究重复的基因组序列 [1-4],就不可能了解形成真核生物复杂基因组的进化机制。这在高等植物中尤其明显,其中重复序列包含高达 97% 的核 DNA [5,6],并显着导致不同分类群之间观察到的异常基因组大小变化 [7-9]。然而,大量且连续多样化的重复元素家族的存在使他们的分析成为一项具有挑战性的任务。因此,研究重复 DNA 对基因组进化的贡献的最广泛使用的方法是基于仅单个或一小部分元素的分离和表征。这些方法在跟踪各种物种中各种重复的命运方面非常有价值[10-12]。然而,它们不允许对阐明全基因组水平的进化趋势所需的重复谱进行全球比较分析。对全面重复分析的需求促使开发了一种基于 DNA 微阵列的分析方法来研究 20 个植物基因组中数百个重复的发生 [13]。虽然成功,但基于微阵列的方法仍然受到一些限制,包括需要先验了解重复序列、阵列容量有限,尤其是无法发现阵列上没有探针的新重复。

The requirement for simultaneous determination of sequence composition and abundance of hundreds of repeat families is best fulfilled by analyzing the complete genome sequence; however, such data is available for only a limited number of model species. Alternatively, low-depth shotgun genomic sequencing can be used to survey the most abundant repeats, as was demonstrated for Gossypium species [9]. However, performing this type of survey using conventional approaches employing Sanger sequencing is still labor-intensive and requires considerable resources. The recent introduction of a massively-parallel pyrosequencing technology developed by 454 Life Sciences ("454-sequencing") has opened new possibilities for high-throughput genome analysis [14]. This approach allows parallel sequencing of hundreds of thousands of individual templates immobilized on microbeads, thus producing megabases of sequence data in a single run. It has been successfully applied to the sequencing of microbial genomes [15], the re-sequencing of mammalian genomes [16], and for transcript profiling [17]. Due to relatively short sequence read lengths (~100 nucleotides on average, or ~250 nucleotides with the improved version of the system), the technology is not yet suitable for de novo sequencing of complex genomes. However, it has a great potential for profiling repetitive sequences in these genomes, as the amount of produced sequence data is sufficient to get a representative overview of the most abundant genomic repeats. For example, a total of 30 Mb determined in a single sequencing run represents only 0.01-fold coverage of a hypothetical 3,000 Mb genome (this is about the haploid genome size of maize or cotton [18]), but provides 10-fold coverage of repeats occurring in the genome in 1,000 copies. The sum of 30 Mb is represented by a set of 300,000 sequence reads which are randomly generated from various genomic loci. Theoretically, they should contain fragments of a given repeat randomly sampled from its individual copies, and the frequency of these fragments in the sequence reads should be proportional to the genomic abundance of the repeat. Therefore, this amount of sequence data should be sufficient to reliably detect abundant (at least 500–1000 copies/1C) genomic repeats, and eventually reconstruct their consensus sequences by assembling the reads derived from their individual copies. Recently, this strategy has been successfully applied to repeat analysis in the 1,115 Mb genome of soybean [19].

通过分析完整的基因组序列可以最好地满足同时确定数百个重复家族的序列组成和丰度的要求;然而,此类数据仅适用于有限数量的模式物种。或者,可以使用低深度鸟枪法基因组测序来调查最丰富的重复序列,正如在棉花物种中所证明的那样 [9]。然而,使用 Sanger 测序的传统方法进行此类调查仍然是劳动密集型的,并且需要大量资源。最近推出的由 454 Life Sciences 开发的大规模并行焦磷酸测序技术(“454 测序”)为高通量基因组分析开辟了新的可能性 [14]。这种方法允许对固定在微珠上的数十万个单独的模板进行并行测序,从而在一次运行中产生数兆碱基的序列数据。它已成功应用于微生物基因组的测序 [15]、哺乳动物基因组的重测序 [16] 以及转录本分析 [17]。由于相对较短的序列读取长度(平均约 100 个核苷酸,或改进版本的系统约 250 个核苷酸),该技术尚不适合复杂基因组的从头测序。然而,它在分析这些基因组中的重复序列方面具有很大的潜力,因为产生的序列数据量足以获得最丰富的基因组重复的代表性概述。例如,在单次测序运行中确定的总共 30 Mb 仅代表假设的 3,000 Mb 基因组的 0.01 倍覆盖率(这大约是玉米或棉花的单倍体基因组大小 [18]),但提供了 10 倍的覆盖率重复在基因组中以 1,000 个拷贝出现。 30 Mb 的总和由一组 300,000 个序列读数表示,这些读数是从各种基因组位点随机生成的。从理论上讲,它们应该包含从单个副本中随机抽取的给定重复片段的片段,并且这些片段在序列读数中的频率应该与重复的基因组丰度成正比。因此,这一数量的序列数据应该足以可靠地检测丰富的(至少 500-1000 个拷贝/1C)基因组重复,并最终通过组装从它们各自拷贝中获得的读数来重建它们的共有序列。最近,该策略已成功应用于大豆 1,115 Mb 基因组的重复分析 [19]。

Based on the theoretical considerations described above, we attempted to adapt parallel sequencing technology for the genome-wide characterization of repetitive elements in garden pea (Pisum sativum L.) and for comparative analysis of its repeat composition with other species. In addition to being a classical model for genetic studies, pea is also one of the model species used in our and other laboratories for studying the impact of repetitive DNA on legume plant genomes. Consequently, a set of well-characterized pea repetitive elements is available [20-24] which could serve as a control in evaluating the accuracy of the developed assays. Pea has a genome of 4,300 Mb/1C [18], which is about 10-fold larger than the genome size of rice or the model legume Medicago truncatula, and about 4-fold larger than the soybean genome. Compared to these species, it is rich in repetitive DNA, which was estimated to comprise 75–97% of its nuclear DNA [5,6].

基于上述理论考虑,我们尝试采用平行测序技术对豌豆 (Pisum sativum L.) 中的重复元件进行全基因组表征,并比较分析其与其他物种的重复组成。 除了作为遗传研究的经典模型外,豌豆也是我们和其他实验室用于研究重复 DNA 对豆科植物基因组影响的模型物种之一。 因此,可以使用一组特征良好的豌豆重复元素 [20-24],它们可以作为评估开发分析准确性的对照。 豌豆的基因组为 4,300 Mb/1C [18],比水稻或模式豆科植物蒺藜苜蓿的基因组大小约 10 倍,比大豆基因组大约 4 倍。 与这些物种相比,它富含重复 DNA,据估计,它占其核 DNA 的 75-97% [5,6]。

Our initial experiments were aimed at evaluating the representation of known repeats in 454 sequence reads. Then we focused on the reconstruction of longer segments of repetitive sequences from multiple overlapping sequence reads, which provided a basis for their further characterization. These data were used to determine the genomic abundance and variability of the major repeat families present in the pea genome. Finally, we used the pea sequence data together with available soybean and M. truncatula sequences to perform comparative analysis of the repeat composition and abundance in these three legume species.

我们最初的实验旨在评估 454 个序列读取中已知重复的表示。 然后我们专注于从多个重叠序列读取中重建较长的重复序列片段,这为它们的进一步表征提供了基础。 这些数据用于确定豌豆基因组中存在的主要重复家族的基因组丰度和变异性。 最后,我们将豌豆序列数据与可用的大豆和蒺藜苜蓿序列一起使用,对这三种豆科植物的重复组成和丰度进行了比较分析。

Results

A single 454 sequencing reaction with pea nuclear DNA resulted in 319,402 usable reads with an average length of 104 nucleotides, yielding a total of 33.3 Mb of sequence data. This is equivalent to 0.77%, or 1/129, of the haploid pea genome (1C = 4,300 Mb). Thus, in theory, repeats occurring at 1,000 copies or greater in the pea genome should be well represented in these sequences, as they should be covered on average by 7–8 sequence reads (1000/129 = 7.8) over their entire length. In order to test this assumption, we determined the representation of previously characterized pea repeats in the 454 data by sequence similarity searches against a database of individual sequence reads and calculated their average coverage by highly significant hits. As expected, low-copy or moderately repeated sequences, such as Zaba MITEs (50–500 copies/1C [22]), were represented by none or only a few hits. However, all 33 of the tested sequences with an abundance exceeding 1,000 copies/1C were well represented in 454 data, and their coverage by sequence reads was proportional to their abundance in the genome. The copy numbers of individual repeats calculated from the frequency of their occurrence in 454 reads were in a good agreement with estimates based on other experimental data (Fig. 1 and Additional file 1). These findings prove that the 454 data are representative for highly repeated sequences and thus can be further used for investigation and comprehensive description of this fraction of the pea genome.

豌豆核 DNA 的单个 454 测序反应产生 319,402 个可用读数,平均长度为 104 个核苷酸,总共产生 33.3 Mb 的序列数据。这相当于单倍体豌豆基因组 (1C = 4,300 Mb) 的 0.77% 或 1/129。因此,理论上,豌豆基因组中出现 1,000 个或更多拷贝的重复应在这些序列中得到很好的表示,因为它们应在其整个长度上平均被 7-8 个序列读取 (1000/129 = 7.8) 覆盖。为了测试这一假设,我们通过针对单个序列读数的数据库的序列相似性搜索确定了 454 个数据中先前表征的豌豆重复的表示,并通过高度显着的命中计算了它们的平均覆盖率。正如预期的那样,低拷贝或适度重复的序列,例如 Zaba MITE(50-500 拷贝/1C [22]),没有或只有少数命中。然而,所有 33 个丰度超过 1,000 拷贝/1C 的测试序列在 454 个数据中得到了很好的体现,并且它们被序列读取的覆盖率与其在基因组中的丰度成正比。从 454 次读取中出现的频率计算出的单个重复的拷贝数与基于其他实验数据的估计值非常一致(图 1 和附加文件 1)。这些发现证明,这 454 个数据代表了高度重复的序列,因此可以进一步用于研究和全面描述这部分豌豆基因组。

Reconstruction of genomic repeats by assembling 454 reads

The process of extracting repeat sequences from the set of 454 sequence reads was implemented using TGICL, a program package originally designed for clustering large EST datasets [25]. The procedure consisted of two steps: (i) clustering the reads based on their mutual similarities into groups of overlapping sequences, and (ii) assembling the reads within the clusters to get longer fragments (contigs) representing consensus sequences. Various clustering and assembly parameters were evaluated in order to get optimal performance with our dataset, which compared to ESTs differed in the short reading lengths and in their considerable sequence variability, reflecting the divergence between individual copies of repeated elements. While the clustering parameters allowed for grouping of relatively divergent sequences, the assembly process was more stringent, and thus typically generated multiple contigs from a single cluster.

从 454 个序列读取中提取重复序列的过程是使用 TGICL 实现的,该程序包最初设计用于对大型 EST 数据集进行聚类 [25]。 该过程包括两个步骤:(i)基于它们的相互相似性将读数聚类成重叠序列组,以及(ii)在簇内组装读数以获得代表共有序列的更长片段(重叠群)。 评估了各种聚类和组装参数,以获得我们的数据集的最佳性能,与 EST 相比,ESTs 在短读取长度和相当大的序列可变性方面有所不同,反映了重复元素的各个副本之间的差异。 虽然聚类参数允许对相对不同的序列进行分组,但组装过程更加严格,因此通常会从单个簇中生成多个重叠群。

The clustering resulted in 22,445 clusters, which were composed from two to thousands of reads. The assembly phase then yielded 25,384 contigs ranging in lengths up to 8,214 bp. Individual contigs were assembled from two to 4,327 reads, and the total number of sequence reads assembled into contigs was 233,303 (73 %). The contigs were characterized by their read depth (RD), expressing the average number of reads assembled over individual positions within the contig consensus sequence, and by their genome representation (GR), calculated as RD multiplied by the contig length. These values allowed us to rate the contig sequences based on their genomic copy numbers and proportion in the genome, respectively. Most contigs were short and composed of only a few reads, whereas 90% of assembled reads were assigned into a relatively small subset of 1,578 contigs with the highest GR, thus corresponding to highly repetitive sequences. Most repeats were represented as sets of overlapping contigs, the number of which was proportional to the sequence diversity of the repeat copies. Among the most important in terms of their genomic abundance were contigs that included coding sequences and conserved LTR regions of Ogre retrotransposons and other LTR-retroelements and of the satellite repeat PisTR-B (Additional file 2). Compared to the three previously sequenced pea Ogre elements [21], this set of Ogre-like contigs showed much higher sequence diversity, suggesting they represent several distinct Ogre subfamilies.

聚类产生了 22,445 个聚类,这些聚类由 2 到数千个读取组成。然后组装阶段产生了 25,384 个重叠群,长度可达 8,214 bp。单个 contigs 由 2 到 4,327 个 reads 组装而成,组装成 contigs 的序列 reads 总数为 233,303 (73 %)。重叠群的特征在于它们的读取深度(RD),表示在重叠群共有序列内的各个位置上组装的平均读取数,以及它们的基因组表示(GR),计算为 RD 乘以重叠群长度。这些值使我们能够分别根据它们的基因组拷贝数和基因组中的比例对重叠群序列进行评级。大多数 contigs 很短,仅由几个 reads 组成,而 90% 的组装 reads 被分配到一个相对较小的 1,578 个 contigs 子集中,具有最高的 GR,因此对应于高度重复的序列。大多数重复被表示为重叠重叠群的集合,其数量与重复拷贝的序列多样性成正比。就其基因组丰度而言,最重要的是重叠群,其中包括 Ogre 反转录转座子和其他 LTR 反转录元件和卫星重复 PisTR-B 的编码序列和保守 LTR 区域(附加文件 2)。与之前测序的三个豌豆 Ogre 元素 [21] 相比,这组 Ogre 样重叠群显示出更高的序列多样性,表明它们代表了几个不同的 Ogre 亚科。

The longest contig (CL2Contig6) represented a 8,214 bp fragment of the rDNA gene cluster, including the complete 18-5.8-25S gene sequences (5,820 bp) surrounded by 3' and 5' parts of large intergenic spacer (Fig. 2A). Comparison to previously published partial pea rDNA sequences revealed its identity with a 1,723 bp fragment of the 18S rRNA gene [GenBank: U43011] and 99.5% similarity to a 620 bp sequence region including the 5.8S rRNA gene and both internal transcribed spacers [GenBank: AY839340].

最长的重叠群(CL2Contig6)代表 rDNA 基因簇的 8,214 bp 片段,包括完整的 18-5.8-25S 基因序列(5,820 bp),被大基因间隔区的 3' 和 5' 部分包围(图 2A)。 与之前发表的部分豌豆 rDNA 序列的比较揭示了其与 18S rRNA 基因 [GenBank: U43011] 的 1,723 bp 片段的同一性,并且与包括 5.8S rRNA 基因和两个内部转录间隔区的 620 bp 序列区域具有 99.5% 的相似性 [GenBank: AY839340]。

Figure 2

Figure 2 Examples of repeat reconstruction from assembled 454 reads. (A) The complete rDNA coding region including genes for 18S, 5.8S and 25S rRNA and parts of the large intergenic spacer (IGS) was reconstructed as a single contig (CL2Contig6). The graph shows the read depth (number of assembled reads) along the contig sequence. (B) Reconstruction of the Ty3/gypsy retroelement Peabody, including most of its long terminal repeat (LTR) sequence and complete polyprotein-coding region (gag-pol). (C) A novel Ty1/copia element Ps-copia-1/751, reconstructed from ten overlapping contigs. The region devoid of stop codons encoding gag-pol was identified in frame +3. Yellow bars depict length and relative positions of the overlapping contigs, stop codons are represented by red vertical lines.

图 2 从组装的 454 读取重复重建的示例。 (A) 完整的 rDNA 编码区,包括 18S、5.8S 和 25S rRNA 的基因和部分大基因间隔区 (IGS) 被重建为单个重叠群 (CL2Contig6)。 该图显示了 contig 序列的读取深度(组装读取的数量)。 (B) Ty3/gypsy retroelement Peabody 的重建,包括其大部分长末端重复 (LTR) 序列和完整的多蛋白编码区 (gag-pol)。 (C) 一种新的 Ty1/copia 元素 Ps-copia-1/751,由十个重叠的重叠群重建。 在第+3帧中鉴定了缺乏编码gag-pol的终止密码子的区域。 黄色条表示重叠重叠群的长度和相对位置,终止密码子由红色垂直线表示。

The contig with the highest genome representation was a 5,097 bp fragment of the LTR-retrotransposon Peabody [GenBank: AF083074]. This contig (CL1Contig2066) had a read depth corresponding to 11,000 copies/1C and was estimated to make up 1.3% of the pea genome. It included a complete internal retrotransposon region surrounded by parts of LTR sequences, which could be further extended by finding and aligning overlapping contigs (Fig. 2B). The internal region contained a gag-pol coding sequence (4,368 bp) devoid of stop codons.

具有最高基因组代表性的重叠群是 LTR-retrotransposon Peabody [GenBank: AF083074] 的 5,097 bp 片段。 该重叠群 (CL1Contig2066) 的读取深度对应于 11,000 拷贝/1C,估计占豌豆基因组的 1.3%。 它包括一个完整的内部反转录转座子区域,由部分 LTR 序列包围,可以通过查找和对齐重叠重叠群来进一步扩展(图 2B)。 内部区域包含一个没有终止密码子的 gag-pol 编码序列 (4,368 bp)。

The highest read depth (193 reads, corresponding to about 25,000 copies/1C) was found for a 1,213 bp contig CL1Contig751 representing the LTR sequence of a novel Ty1/copia element designated Ps-copia-1/751. The element reconstruction from ten overlapping contigs resulted in identification of the complete LTR and most of the internal region including open reading frame encoding gag-pol polyprotein (Fig. 2C).

对于 1,213 bp 重叠群 CL1Contig751,发现了最高的读取深度(193 个读取,对应于约 25,000 个拷贝/1C),该重叠群 CL1Contig751 代表称为 Ps-copia-1/751 的新型 Ty1/copia 元素的 LTR 序列。 来自十个重叠重叠群的元件重建导致鉴定了完整的 LTR 和大部分内部区域,包括编码 gag-pol 多蛋白的开放阅读框(图 2C)。

In addition to highly repeated sequences it was also possible to at least partially reconstruct less abundant repeats, many of which were novel to the pea genome. For example, an over 7 kb region of a MuDR-like DNA transposon, including 2 putative coding regions, could be reconstructed from 25 overlapping contigs. MuDR elements were estimated to occur in about 2,200 copies in the pea genome, and similar abundance was also found for another DNA transposon family, En/Spm, for which it was possible to reconstruct a 3 kb fragment of the transposase-coding region (not shown).

除了高度重复的序列外,还可以至少部分重建不太丰富的重复序列,其中许多重复序列对豌豆基因组来说是新的。 例如,一个超过 7 kb 的 MuDR 样 DNA 转座子区域,包括 2 个假定的编码区域,可以从 25 个重叠的重叠群中重建。 MuDR 元件估计出现在豌豆基因组中约 2,200 个拷贝中,并且在另一个 DNA 转座子家族 En/Spm 中也发现了类似的丰度,因此可以重建转座酶编码区的 3 kb 片段(不是 如图所示)。

The clustering and contig building procedure was also found useful for identifying novel tandemly repeated sequences. Assembling overlapping reads into longer contigs facilitated reconstruction of repeats with monomers exceeding the length of single reads and allowed their identification based on the tandem subrepeats present within the contigs. A number of contigs representing potential satellite repeats with monomers from 50 to 867 bp were identified; except for the previously described PisTR-B satellite [20], they all represented novel sequences. Fourteen of the most abundant repeats (Table 1) were used as probes for in situ hybridization on pea mitotic chromosomes in order to test if they have a genomic distribution typical for satellite DNA. Such hybridization patterns, consisting of signals concentrated into limited number of spots corresponding to long arrays of the satellite sequences, were observed for thirteen repeats, whereas only one produced dispersed signals (Fig. 3A–C and Table 1). The signals occurred mostly in (peri-) centromeric and terminal chromosome regions, and each repeat displayed a specific hybridization pattern. No typical centromeric satellite repeats were found, although the repeat TR-11 strongly labeled central centromeric regions in five out of the seven chromosome pair (Fig. 3B).

还发现聚类和重叠群构建程序可用于识别新的串联重复序列。将重叠读数组装成更长的重叠群有助于重建具有超过单个读数长度的单体的重复,并允许基于重叠群内存在的串联亚重复对其进行识别。鉴定了许多代表潜在卫星重复序列的重叠群,单体长度从 50 到 867 bp;除了之前描述的 PisTR-B 卫星 [20],它们都代表了新的序列。十四个最丰富的重复序列(表 1)被用作在豌豆有丝分裂染色体上进行原位杂交的探针,以测试它们是否具有卫星 DNA 的典型基因组分布。这种杂交模式由集中在与卫星序列长阵列相对应的有限数量的点组成的信号组成,在 13 次重复中被观察到,而只有一个产生分散的信号(图 3A-C 和表 1)。信号主要发生在(周围)着丝粒和末端染色体区域,并且每个重复都显示出特定的杂交模式。没有发现典型的着丝粒卫星重复,尽管重复 TR-11 在七对染色体中的五个中强烈标记了中心着丝粒区域(图 3B)。

Composition of the most repetitive fraction of the pea genome

Using a combination of various tools including sequence similarity searches, conserved protein domain detection, and structure analysis of the contigs, it was possible to assign the most abundant reconstructed sequences into specific classes of repetitive elements (Table 2 and Additional file 3). It was found that the majority of pea repetitive DNA is made up of LTR-retrotransposons, with the most prominent group being the Ty3/gypsy-like Ogre elements, which alone were estimated to constitute 20–33% of the genome. Investigation of the sequence variability of contigs representing overlapping fragments of Ogre sequences revealed the presence of several subfamilies of these elements. Further analysis based on quantification of the sequence reads from contig regions that include the primer binding site (PBS) typical for Ogres [26] and comparison of their surrounding sequences confirmed the occurrence of three distinct subfamilies. It also provided an estimate of their abundance, which was about 30,000 elements for each of the two major subfamilies and 8,000 copies/1C for a minor one.

使用包括序列相似性搜索在内的各种工具的组合,保守蛋白质域检测,以及重叠群的结构分析,可以将最丰富的重建序列分配到特定类别的重复元素中(表 2 和附加文件 3)。发现大部分豌豆重复 DNA 由 LTR 反转录转座子组成,其中最突出的组是 Ty3/gypsy 样 Ogre 元件,估计仅它们就构成了基因组的 20-33%。对代表 Ogre 序列重叠片段的重叠群的序列变异性的研究揭示了这些元素的几个亚科的存在。基于对来自包含 Ogres [26] 典型的引物结合位点 (PBS) 的 contig 区域的序列读数的量化以及它们周围序列的比较的进一步分析证实了三个不同亚科的发生。它还提供了对它们丰度的估计,两个主要亚科中的每一个大约有 30,000 个元素,一个次要亚科为 8,000 个拷贝/1C。

Although other retroelement families were found in considerably smaller numbers, there were several elements which made up significant proportions of the genome. They included Peabody, which made up 2–3% of the genome and displayed very low sequence variability suggesting its recent amplification. Other important groups of Ty3/gypsy elements were represented by PIGY [24] and Cyclops [27]. Ty1/copia retrotransposons were found to be less frequent, being represented by PDR [28] and a group of SIRE1-like sequences [29]. However, the most abundant was a novel element, Ps-copia-1/751 (Fig. 2C), which made up about 2% of the pea genome. The LTR sequence of this element was estimated to occur in at least 25,000 copies in the pea genome, whereas other regions are less frequent (about 8,000 copies/1C), thus indicating the existence of a large number of solo-LTRs derived from this element (Table 2).

尽管其他逆转录元素家族的数量要少得多,但有几个元素构成了基因组的重要比例。 他们包括皮博迪,它占基因组的 2-3%,显示出非常低的序列变异性,表明它最近有扩增。 其他重要的 Ty3/吉普赛元素组由 PIGY [24] 和 Cyclops [27] 代表。 Ty1/copia 反转录转座子的频率较低,由 PDR [28] 和一组 SIRE1 样序列 [29] 代表。 然而,最丰富的是一种新元素 Ps-copia-1/751(图 2C),它约占豌豆基因组的 2%。 估计该元件的 LTR 序列在豌豆基因组中至少出现 25,000 个拷贝,而其他区域的频率较低(约 8,000 个拷贝/1C),因此表明存在大量源自该元件的单 LTR (表 2)。

。。。

Conclusion

This work provided the first detailed survey of repetitive sequences in garden pea. It confirmed the expected high proportion of repeats in the pea genome and revealed that it is mostly attributed to various families of mobile elements. Amplification of a few groups of Ty3/gypsy elements, especially those belonging to Ogre-like retrotransposons, contributed the most to the bulk of pea repeats. Ty1/copia elements were found to be less abundant but more diverse in their sequences, occurring in a number of distinct (sub-)families. Other mobile elements including non-LTR retrotransposons (LINEs) and DNA transposons of the MuDR and En/Spm families were also detected. However, their total abundance did not exceed thousands of copies per haploid genome, thus representing only a minor part of pea nuclear DNA. Tandem repeats identified in the pea genome included microsatellites, three variants of telomeric minisatellites, and exceptional diversity of satellite repeats. Localization of newly identified satellite sequences on mitotic chromosomes revealed their family-specific hybridization patterns, providing novel cytogenetic landmarks for chromosome mapping.

这项工作首次详细调查了豌豆中的重复序列。它证实了豌豆基因组中预期的高重复比例,并揭示它主要归因于各种移动元件家族。几组 Ty3/吉普赛元素的扩增,尤其是那些属于 Ogre 样反转录转座子的元素,对大部分豌豆重复序列的贡献最大。发现 Ty1/copia 元素的数量较少,但它们的序列更加多样化,发生在许多不同的(亚)家族中。还检测到其他移动元件,包括非 LTR 反转录转座子 (LINEs) 和 MuDR 和 En/Spm 家族的 DNA 转座子。然而,它们的总丰度不超过每个单倍体基因组数千个拷贝,因此仅代表豌豆核 DNA 的一小部分。在豌豆基因组中鉴定出的串联重复序列包括微卫星、三种端粒小卫星变体以及异常多样性的卫星重复序列。新发现的卫星序列在有丝分裂染色体上的定位揭示了它们的家族特异性杂交模式,为染色体作图提供了新的细胞遗传学标志。

Although the presented analysis yielded a wealth of information about the repeat composition of the pea genome, it was also useful in uncovering various limitations of our analytical approaches, which should be improved in the future. In addition to these improvements, a number of novel ways to utilize 454 data in plant genome analysis can be envisioned. They include, for example, repeat masking in genome sequencing projects, detailed investigation of intra- and intergenomic repeat variability, and identification of conserved non-coding regulatory sequences. Of special interest is the application of this technology to comparative genomics in a wide range of species, which should provide key information for understanding evolutionary patterns of repetitive sequences and their impact on genome evolution. Our results demonstrated the feasibility of this approach and revealed that in spite of differences in abundance of individual families, the repeat composition in pea and M. truncatula is similar, whereas both these species share only a few conserved repeats with soybean.

尽管所提出的分析产生了有关豌豆基因组重复组成的大量信息,但它也有助于揭示我们分析方法的各种局限性,这些局限性应该在未来得到改进。除了这些改进之外,还可以设想一些在植物基因组分析中利用 454 数据的新方法。例如,它们包括基因组测序项目中的重复掩蔽、基因组内和基因组间重复变异性的详细研究以及保守的非编码调控序列的鉴定。特别令人感兴趣的是将该技术应用于广泛物种的比较基因组学,这将为理解重复序列的进化模式及其对基因组进化的影响提供关键信息。我们的结果证明了这种方法的可行性,并表明尽管个体家族的丰度存在差异,但豌豆和蒺藜苜蓿中的重复组成相似,而这两个物种与大豆仅共享少数保守重复。

不翻了,感兴趣的小伙伴自行下载阅读吧。。。

上一篇 下一篇

猜你喜欢

热点阅读