蛋白质组学数据搜库及FDR的控制

2020-03-07  本文已影响0人  MJades

Part 1. 蛋白质组学中,谱图搜库是如何实现的?

蛋白质组学中,各种软件对质谱得到的谱图进行搜库时通常是利用以下三种方法之一进行:

  1. 根据相关的质量信息获得部分或完整的肽序列(first implemented by PeptideSearch and graph theory based de novo methods);

2.实验和计算得到的谱图的自相关性(最先应用于SEQUEST);

3.计算观测到的理论碎片质量和实际碎片质量之间匹配上的数目来自于偶然的概率(Mascot中率先使用)。

针对Andromeda肽段搜索引擎做些介绍:
嵌入到MaxQuant中Andromeda肽段搜索引擎就是基于二项式分布概率对肽段-谱图进行打分的,同时利用该得分进行后续的分析,如:对肽段进行排序、确定肽段修饰的可能性;standalone Andromeda可以处理少量的谱图,每张谱图经处理后都可以得到对应的有得分的肽段列表和蛋白列表,没有严格的FDR的控制。
Andromeda的优势展现在:1.确定同一肽段的多种修饰;2.解析混合谱图。

Schematic of the peptide scoring algorithm
transfer to charge=1

Part 2. FDR的控制

when multiple independent statistical hypothesis tests are conducted, single hypothesis significance measures (like p-value) are neither sufficient nor amenable to extrapolation to calculate population error rate. This is a classic case of what is called as the multiple testing problem.

  1. In the context of proteomics, it is a global estimate of the false positives present in the results obtained by a database search algorithm. There are many different strategies to estimate FDR like the nonparametric simple target-decoy (TD) database searches and parametric or semi-parametric mix- ture modeling approaches used in the Trans-proteomics pipeline (TPP).
  2. The q-value of a PSM provides a direct measure of significance for a particular PSM with respect to the complete dataset and the risk accrued to the total accepted matches if that hit is deemed significant.

The basic assumption made for target-decoy (TD) approach is that the number of false PSMs in decoy search will be equal to the number of false PSMs in target search above a given threshold score

The number of false positives divided by the total hits allows for easy calculation of FDR.

Posterior error probability (PEP) is the probability of a PSM to be incorrect.

  1. While the q-value conveys the risk (error introduced) in the whole dataset if we accept the PSM at hand, the PEP on the other hand informs us whether the PSM is likely to be correct or not.
  2. FDR can be calculated from PEP by integrating (summing up) all the PEPs. PEPs can be accurately calculated by using machine learning to learn the model parameters from labeled (correct and incorrect) training data. For any given score x, the PEP can be predicted from the model parameters. This strategy is used in PeptideProphet and ProteinProphet.
  1. ProteoStats requires the data to be searched using separate TD approach as it can perform the TD competition after the search as suggested by Fitzgibbon et al..
  2. TD searches are completed separately and results in the form of target and decoy top hits provided as input to ProteoStats. When the searches are conducted separately, all different FDR methods can be applied a posteriori, but if a concatenated search is used, only concatenated FDR method can be applied as the correspondence between TD top hits is lost. ProteoStats removes the pep- tides identical in decoy and target considering isoleucine and leucine as identical. The resulting TD sets are sorted separately on the basis of scores/e-values/p-values from best to worst and depending on the search strategy chosen the FDR, q-value, and receiver operating curve (ROC) are calculated.

The FDR for protein estimation is calculated as the ratio of the expected number of false-positive protein identifications (those that have a hit to the decoy database proteins) to that of the total number of protein identifications mapping to the target database at any threshold protein score. For protein FDR, MAYU software can be used which performs protein identification-level FDR on the basis of peptide identifications.

结合Proteome Discoverer 2.2中应用的算法,对一些细节进行解释。

  1. 2.2中,默认PSM的FDR计算是将target和decoy database分开计算的;

  2. 当要搜索的spectra或者要搜的蛋白数目较少时,FDR不起作用,因为匹配到database的数目会很少,很难给出有意义的统计值;

  3. 2.2中默认的decoy database是将protein sequence直接反转过来,但是注意以下两种情况不适合用这种decoy database:
    a. peptide mass fingerprinting;
    b. no-enzyme MS/MS searches, 尤其是dynamic modification;


    PD 2.2中关于decoy database的说明
  4. 在PD2.2中 set up FDRs有两种: Percolator node and the Target Decoy PSM Validator node.

Percolator is a superior validation algorithm that uses a machine learning approach, but it requires a sufficient number of target and decoy matches that are not always available. In these cases, you can use the Target Decoy PSM Validator node. This node triggers a target and decoy search and calculates score thresholds to achieve the specified target false discovery rate (FDR). The derived score thresholds for the strict and relaxed FDR separate the identified PSMs into high-, medium-, and low-confidence identifications.

Percolator的限制
  1. 可以利用Maximum delta Cn减少PSM数目,从而影响PSM的FDR. 2.2中默认值是0.05. 在一般情况下,Top 1的score会很明显的大于其他被选择的PSM,但是当存在动态修饰时,匹配比较好的PSM的score会很接近;所以,在研究磷酸化时,应该适当的放大maximum delta Cn的值。
    delta Cn
    此外,还可以通过设置Maximum Rank parameter,Maximum Delta Mass parameter,Score and Threshold parameters对PSM进行筛选。
上一篇下一篇

猜你喜欢

热点阅读