重复序列到底要不要mask?
The case for not masking away repetitive DNA
参考文章:https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-018-0120-9
这篇文章作者表述了一个在使用比对软件时常常涉及的问题:我的reads比对到参考基因组上时到底要不要mask低复杂度区域?一方面减少比对工具的负担,一方面希望更多地让信息focus到编码序列上
Main
实际上有不少的比对软件都是默认掩蔽掉低复杂度区域的:alignment tools such as BLAST that mask low complexity and repetitive regions as a default option.
TE(转座元件)和ERV(内源性反转录病毒)有成为新的潜在enhancer的可能,探索基因组调控关系时,不应将它们mask;更多地是在RNA-seq mask(如果只关注基因表达)
if a researcher were performing a ChIP-seq experiment to identify where a transcription factor binds (for example), or a chromosome confirmation capture (4C/HiC) experiment to identify enhancer-promoter interactions, they would potentially miss important enhancers if repetitive DNA was filtered from their analysis.
Technology has advanced to better assay the repetitive portion of the genome
Q:那么问题来了;如果不掩蔽这些重复序列元件,比对的困难怎么解决?
This problem is most severe for small RNA biology or when only a sequence tag can be obtained, as a large fraction of the reads cannot be mapped to a unique position in the genome because only 15-30 nucleotides of sequence information is available 尤其是小RNA研究中更为严重
实际上有很多新的reads mapping工具考虑到了这点:
read mapping technologies have evolved or have been specifically designed to efficiently assay repetitive DNA from whole genome data. These tools include MELT, RetroSeq, EpiTEome and McClintock for TE insertion site identification [19,20,21,22], T-lex to identify presence/absence of TE copies [23], Clari-TE to resolve nested TE structure [24], TEtoolkit/TEtranscripts for TE enrichment analysis from experiments such as RNA-seq and ChIP-seq [25, 26], and many others
作者提出了mapping时的一些方案,并进行了讨论:
一种是只取uniquely mapped reads, 但对于TE丰富的基因组可能会丢弃很多reads,使得很多TE不能被发现
- First, the location in the genome where the read matches best should be the only one reported; however multiple best-matching regions may exist (i.e. multi-mapping reads). A conservative approach is to report only the reads that perfectly and uniquely match once in the genome (for example, my lab used this method for MethylC-seq of TEs in [27]
一种是把multi-mapping reads分配到它们最佳比对位置上,对于TE丰富的基因组效果较好,不过对于单个的TE的研究不能很精细,分辨率不够,难以得到TE family中某个成员的富集(稀疏reads mapping)
- A second approach is to randomly divide the multi-mapping reads to their best positions in the genome (for example, my lab used this method mapping small RNA reads in [29]), although this will dilute the analysis of individual repetitive elements across their entire repeat families. This approach should be used in TE-rich genomes, and when analysis of TE families is performed but resolution down to individual elements is not necessary
按照reads cluster位点分层级地分配多比对reads;基于有一些TE也可以获取uniquely mapped reads,这样有利于获得individual elements的分辨率,缺点是会导致一些reads错误地分配到本来不产生他们的位点上
- A third approach is to hierarchically distribute multi-mapping reads guided by the evidence of where the uniquely mapping reads cluster.The idea behind this approach is that we already know that some individual TE elements generate reads (by using the uniquely mapping reads), so it is most likely that these same TEs are also the producers of the multi-mapping reads. This approach should be used when TE regulation requires drilling down beyond the repeat family level to individual elements
最后一个方法是直接比对到专门的序列数据库上
- Lastly an approach that has been utilized for years within the Mobile DNA community, reads can be mapped to single elements or databases of ‘consensus’ sequences that are created from alignments of many individual repetitive genomic sequences
the ATAC-seq peaks were compared with the locations of annotated repeats (RepeatMasker) downloaded from the UCSC genome browser. As repeats of different classes vary greatly in numbers, a random set of peaks with identical lengths of ATAC-seq peaks was used for the same analysis as a control.
引自:The landscape of accessible chromatin in mammalian preimplantation embryos,https://doi.org/10.1038/nature18606
作者选取第一个方案,还是去除了multi-mapped reads。应该取决于关注点是什么,比如peaks附近的那些repeats
实际上除了软件性能的提升,还有一个提升方向是增加reads length
Maximize the data output from your effort (and cost) input
尽可能最大化地利用产出的数据
Two recent examples of TE/ERV analyses done well both represent cases where considering repetitive elements answered questions that would otherwise be unexplainable:
- The structural variation map of 2504 genomes that specifically identified Alu, L1 and SVA mobile element insertions across human populations [39] and
- a ChIP-seq analysis discovered that ERVs have spread interferon-inducible enhancer elements and shaped the evolution of innate immunity [7].
本文引用
https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-018-0120-9
The landscape of accessible chromatin in mammalian preimplantation embryos,https://doi.org/10.1038/nature18606