snv过滤
Errors are known to accumulate in and around homopolymers, inverted repeats and G-rich motifs such as GGT and GGC, so these would be ideal inclusions in a best-practice filtering design for Illumina data.
常用的过滤方法:
- wthin read position (变异位点在序列上面的位置)
- extreme strand bias (链偏好性,譬如变异在正链或者负链上较多)
- flanking homopolymer (均聚物周围)
- lower mapping quality (低比对质量)
- proximity to indels (邻近indel,这个不太实际,因为indel的假阳性比snp还要高,所以依靠indel来过滤snp不太可行)
- nearby SNVs (在snp周围,也就是说snp单位密度太大也是不太可信的)
- depth (测序深度)
文中用的过滤方法(filters designed to remove sequencing and mapping errors):
- variant bases emanate exclusively from one strand (变异来源于一条链)
- mean variant base quality is less than 15 (同一变异碱基的平均质量值低于15)
- no variant has base quality over 30 (同一变异位点的所有碱基质量值均低于30)
- mean variant mapping quality is less than 15 (同一变异位点的平均比对质量值低于15)
- no variant has mapping quality over 40 (同一变异位点的所有reads的比对质量均低于40)
- more than two candidate SNVs (identified by any algorithm) are within 50 bp either side (50bp内有超过两个snv位点,这个可能过于严格,可考虑放松些,视具体情况而定)
- spanning deletions contribute >20% to the overall depth in either sample; or candidates are immediately adjacent to indels in >20% of reads in either sample
首先,进行strand bias
过滤,条件为在每个变异位点上每条链都至少有一条read支持变异(也即正负链必须都存在变异,而不是只有正链或者负链存在变异),进行strand bias
过滤后的序列再分别进行进行五个过滤,以查看这五个过滤方法的差异。
(1)50bp内超过2个snv的变异位点过滤掉。这一步主要是过滤掉phasing sequencing errors 和 一个序列可能存在多个错配的情况;
(2)remove sites with spanning deletions contributing more than 20% to overall depth(个人理解就是在这个变异位点spanning deletions 的reads数超过这个位点深度的20%就过滤掉),这一步主要是过滤可疑比对区域(such as those existing at repetitive or common sequences);
(3) remove sites immediately adjacent to indels in more than 20% of reads. This filter is designed to remove sites that differ from reference by virtue of the mapping algorithm for indel placement, though this comes at the cost of missing any real somatic sites genuinely adjacent to an indel event.
如果超过20%的reads的snp与indel直接相邻,则过滤掉这些位点。这一步主要是优先保留indel变异,虽然也可能会过滤掉那些真实的snv。
(4)碱基质量值过滤。变异位点碱基的平均质量值低于15的过滤掉,变异位点碱基的所有变异质量值均低于30的过滤掉;
(5)比对质量值过滤。变异位点reads的平均比对质量值低于15的过滤掉,所有reads比对质量值均低于40的过滤掉。
Errors are known to accumulate in and around homopolymers, inverted repeats and G-rich motifs such as GGT and GGC, so these would be ideal inclusions in a best-practice filtering design for Illumina data.
参考文献:
A comparative analysis of algorithms for somatic SNV detection in cancer