高通量数据误差和几个测序相关概念

2018-02-11 本文已影响171人思考问题的熊

这几天看拼接相关知识的时候正好看到了一些关于测序数据误差的内容，顺手记录。另外，在误差来源中涉及到了几个测序的专有名词，一道记录。

artifacts来源

Random errors - substitutions and insertion-deletions

Illumina sequencing involves incorporation of single nucleotides at the ends of chains, readout of the nucleotides and reversion of dye terminator to continue. Substitution is the dominant form of error with this technology.
Homopolymer error

Sequencing-by-synthesis technologies estimate the number of identical nucleotides based on signal intensity. For a long stretch of identical nucleotides, they experience homopolymer errors.
Quality drop near 3' ends of reads

The efficiency of DNA polymerase falls off after a certain length in most sequencing technologies.
Coverage bias in AT-rich or GC-rich regions

Coverage is non-uniform and does not follow Poisson distribution. The GC content of underlying sequence affects coverage.
Distance between the read pairs is variable in mate pairs

Distances vary randomly. Moreover, the variation is larger for longer insert size.
Mate pair inversion
Duplication due to PCR amplification
Inherent noise of PacBio reads
Diploid genome

Diploid genomes have differences between two chromosomes. This is not an artifact, but accurate reflection of reality. However, many assembly programs assume that the chromosomes are identical and diploid difference appears as an artifact to them.

测序数据的概念解释

如下示意图所示，测序过程中，DNA打断的哪些片段就是fragment，随后fragment 加adaptors，单端测序只测fragment 一段，而双端测序会测一个fragment 的两端。通常情况下，R1R2来自同一个DNA片段，但是因为fregment 的长度通常大于R1+R2，所以在两者间存在一个gap。同时，R1和R1 两者的位置关系和距离是已知的。


fragment                  ========================================
fragment + adaptors    ~~~========================================~~~
SE read                   --------->
PE reads                R1--------->                    <---------R2
unknown gap                         ....................

各种size

所谓insert 其含义是as the piece of DNA inserted between the adaptors which enable amplification and sequencing of that piece of DNA ，insert size 指的是两个adapter之间的长度，这其中包含了R1,R2和gap。和gap本身应该叫做"inner mate distance" ，这个值跟测序建库的方式非常相关。

PE reads      R1--------->                    <---------R2
fragment     ~~~========================================~~~
insert          ========================================
inner mate                ....................

reads 重叠

Illumina MiSeq 可以产生250bp的PE read，而建库又可以产生少于500bp的fragment，如此导致的结果是两个read产生了重叠。这种情况下，通常我们可以把两个reads 缝合成一个长的单端read。

fragment          ~~~========================================~~~
insert               ========================================
R1                   ------------------------->
R2                                   <-----------------------
overlap                              ::::::::::
stitched SE read     --------------------------------------->

引入adaptor

如果当fragment在短下去，以至于一个read的长度已经长于fragment，就会导致read本身没用的那个adaptor也被测通，这个时候就需要用软件去除可能含有的adaptor序列。

有的时候如果看某些原始的测序数据会发现在5'端看到一些N，就是因为再去除了接头之后，为了是read长度相等后来填上去的。

tiny fragment       ~~~~======================~~~~
insert                  ======================
R1                      ------------------------>
R2                   <------------------------
read-through         !!!                      !!!

"insert" refers to the DNA fragment between the adaptors, and not the gap between R1 and R2. Instead we refer to that as the "inner mate distance". In some cases, when reads overlap, the inner mate distance can actually be negative.

对基因组拼接的影响

如果建库的时候fragment短，会提高测序的深度，但是进行拼接的时候，如果两个contig之间的gap比insertion size还长，那么这个gap 就填不好了。但是如果建库的fragment 比较长，只要有一个read比对到了一个唯一的位置，就可以通过估计两个read之间的距离来分析另一个read所在的位置。也就是说可以给出contig之间的相对位置，拼出scaffold。

Paired end reads with long insert size can be used to overcome the limitations of read length.

参考

1.误差来源

2.概念解释

加入靠谱熊基地，和大家一起交流