全基因组/外显子组测序分析funny生物信息

snv过滤

2018-10-10  本文已影响10人  井底蛙蛙呱呱呱

Errors are known to accumulate in and around homopolymers, inverted repeats and G-rich motifs such as GGT and GGC, so these would be ideal inclusions in a best-practice filtering design for Illumina data.

常用的过滤方法:

文中用的过滤方法(filters designed to remove sequencing and mapping errors):

首先,进行strand bias过滤,条件为在每个变异位点上每条链都至少有一条read支持变异(也即正负链必须都存在变异,而不是只有正链或者负链存在变异),进行strand bias过滤后的序列再分别进行进行五个过滤,以查看这五个过滤方法的差异。
(1)50bp内超过2个snv的变异位点过滤掉。这一步主要是过滤掉phasing sequencing errors 和 一个序列可能存在多个错配的情况;
(2)remove sites with spanning deletions contributing more than 20% to overall depth(个人理解就是在这个变异位点spanning deletions 的reads数超过这个位点深度的20%就过滤掉),这一步主要是过滤可疑比对区域(such as those existing at repetitive or common sequences);
(3) remove sites immediately adjacent to indels in more than 20% of reads. This filter is designed to remove sites that differ from reference by virtue of the mapping algorithm for indel placement, though this comes at the cost of missing any real somatic sites genuinely adjacent to an indel event.
如果超过20%的reads的snp与indel直接相邻,则过滤掉这些位点。这一步主要是优先保留indel变异,虽然也可能会过滤掉那些真实的snv。
(4)碱基质量值过滤。变异位点碱基的平均质量值低于15的过滤掉,变异位点碱基的所有变异质量值均低于30的过滤掉;
(5)比对质量值过滤。变异位点reads的平均比对质量值低于15的过滤掉,所有reads比对质量值均低于40的过滤掉。

Errors are known to accumulate in and around homopolymers, inverted repeats and G-rich motifs such as GGT and GGC, so these would be ideal inclusions in a best-practice filtering design for Illumina data.

参考文献:
A comparative analysis of algorithms for somatic SNV detection in cancer

上一篇 下一篇

猜你喜欢

热点阅读