关于blast的-max_target_seqs参数的使用错误
今天无意间看到一篇文献"Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows",讲的是blast的一些积年使用错误问题。比如参数‘-max_target_seqs’
之前大部分人都将这个参数的值设置为1,认为会输出最优匹配的一条,但是作者验证后发现,这是一个错误的用法,它输出的并不是最优匹配的一条结果,而是第一条较好的匹配结果;更糟糕的是,产生的输出取决于序列在数据库中出现的顺序。对于相同的比对任务,使用不同版本的数据库时,即使所有版本都包含相同的最佳匹配结果,但是BLAST却返回不同的结果。而且以不同的方式对数据库进行排序,也会导致在将max_target_seqs参数设置为1时,BLAST返回不同的“top hit”。原文如下:
To enable the efficient processing of large data sets, researchers frequently rely on shortcuts aimed at reducing the number of BLAST results that need to be processed. A common strategy involves using the ‘-max_target_seqs’ parameter of the NCBI BLASTþ suite. According to the BLAST documentation itself (2008), this parameter represents the ‘number of aligned sequences to keep’. This statement is commonly interpreted as meaning that BLAST will return the top N database hits for a sequence query if the value of max_target_seqs is set to N. For example, in a recent article (Wang et al., 2016) the authors explicitly state ‘Setting “max target seqs” as “1,” only the best match result was considered.’
To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest
scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database,not the best hit as one would assume. Worse yet, the output produced
depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions
contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter
to 1.
然后在经过自己的多次测试之后发现,blast中并没有哪个参数可以返回最优匹配的结果,最好的方式就是通过自己写脚本进行过滤筛选!