Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011. Practical Guide to Significantly Improve Peptide Identification Sensitivity and Accuracy Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
The Sensitivity and Accuracy Dilemma false Accuracy and sensitivity are two competing goals in MS/MS data analysis for peptide identification. All existing database search software uses a scoring function to measure the peptide-spectrum matching quality. After the analysis, the identified peptides are sorted according to the score, and only the ones above a score threshold are reported. However, due to the imperfection of the scoring function and data quality, the score distributions of the true and false identifications overlap. This makes it difficult to choose the right score threshold for result reporting. For example, if we pursue the sensitivity by using a lower score threshold, while more true identifications will be reported, the number of false positives will be increased at the same time. Thus, sensitivity and accuracy must be talked about together for peptide identification analysis with MS/MS. In recent days, the accuracy of the result is often measured by the false discovery rate, or FDR, which is simply defined as the ratio between the number of reported false hits to the total number of reported hits. true score
Publication Guideline Earlier experiments paid too much attention on sensitivity and not enough on accuracy. MCP started the guideline in 2004 to ensure accuracy. Earlier research in this field paid too much attention on sensitivity and not enough on accuracy. As noted in an MCP guideline published in 2004, “a significant but undefined number of proteins being reported as identified in proteomics articles are likely to be false positives”. The publication of that guideline in 2004 is an effort to reduce the number of false discoveries in this type of analysis. Seven years has past since the publication of this guideline. Is the goal of this guideline achieved?
People are generally over-optimistic about how reliable their results are. – ABRF iPRG 2011. “ ” 1% In this year, ABRF conducted a public study to assess the peptide identification capability with mass spectra produced with a high-resolution ETD instrument. They published a common dataset and a protein sequence database, and asked the participants to submit the peptide identification results, using whatever software or combination of software. The only requirement is that the submission should have a no more than 1% false discovery rate, or FDR. After receiving 45 submissions, ABRF used the consensus of the submissions as the control to estimate the FDR of each submission. This figure summarizes their finding. Each vertical bar indicates the estimated FDR of a submission. The FDR should be between the height of red bar, and the total height of red + yellow. To their surprise, majority of the submissions have a much higher than 1% FDR in the results. They conclude that “people are generally over-optimistic about how reliable their results are”, even 7 years after the publication of the MCP guideline. Estimated FDR upper bound Estimated FDR lower bound 30 out of 45 submissions have FDR much higher than the required 1% iPRG/ABRF 2011 Study
PEAKS Achieved both Sensitivity and Accuracy 1% Another interesting fact of this figure is that the submissions were sorted by the number of peptides identified, with the most number at the left. There is a general trend that people who wanted to report more peptides suffered from a high FDR. Noticeably, the two results obtained by PEAKS 5.3 both reported the high number of peptides, and well-controlled the FDR at the required 1% level. How did we achieve that? I will try to reveal some of the reasons in this powerpoint show. PEAKS PEAKS More peptides in submission
Outline FDR – pitfalls and solutions De novo sequencing assisted database search Three essential examinations to ensure result quality. There will be three parts remaining in this talk. In the first part we will examine the common pitfalls existing in today’s FDR estimation method, and how PEAKS successfully avoided those pitfalls. Then, we explore one of the main reasons for PEAKS 5.3’ increased performance. The PEAKS DB module in PEAKS 5.3 heavily depends on our outstanding de novo sequencing result. At the third part of the talk, I will suggest three essential examinations for every peptide identification analysis, in order to ensure the result quality.
1. FDR – pitfalls and solutions Now the first part: FDR pitfalls and solutions.
FDR Estimation Protein DB Identified Peptides target Search Engine ≈ # false target hits ≈ # decoy hits decoy Today’s most widely used method for FDR estimation is the target-decoy strategy. This is a well-established method in statistics and started to be used in proteomics around 2007. In this approach, a decoy database that contains the same number of proteins as the target database are searched together by the database search engine to identify peptides. As illustrated in the figure, the blue colors indicate the target hits and the orange colors indicate the decoy hits, the squares are the false hits, and circles are true hits. The decoy proteins are randomly generated so that any decoy hit is supposedly a false hit. Since the search engine doesn’t know which sequences are from target and which are from decoy, when it makes a mistake, the mistake falls in the target and decoy databases with equal probability. Thus, the total number of false target hits can be approximated by the number of decoy hits in the final result. And the FDR can be estimated by the ratio between the numbers of decoy hits and the number of target hits. The target-decoy strategy is a powerful method for FDR estimation. However, as we will discover in the next little while, such a powerful method must be used with caution to avoid FDR underestimation. 𝐹𝐷𝑅= #𝑑𝑒𝑐𝑜𝑦 #𝑡𝑎𝑟𝑔𝑒𝑡
Pitfall 1 – Multiple Round Search more targets than decoys Round 1. Fast Search Round 2. More Sensitive Search The first pitfall in the use of target-decoy approach for FDR estimation is due to the so-called multiple round search strategy in today’s database search software. This multi-round search was popularized by the X!Tandem program published in 2004, in order to speed up the computation. The first round uses a fast but less sensitive search method to quickly identify a shortlist of proteins from the large database. Then, the second round uses a more sensitive but slower search method to identify peptides, but only from the short list of proteins. This effectively speeds up the search without sacrificing too much sensitivity. Indeed, X!Tandem is one of the fastest search algorithm used today. However, as pointed out by a paper published in JPR in 2010, this multiple-round search strategy screws up the target-decoy estimation of the FDR. The reason is that after the first round, there will be more target proteins than the decoy in the short list. Thus, if the second round search makes a mistake, The mistake will be more likely in the target proteins. So, we will end up with fewer decoy hits than the actual false target hits. This causes the FDR underestimation. The JPR paper in 2010 provided a fix to this problem. But a year later, in another JPR paper, Bern and Kil pointed out that the fix was wrong, and proposed a different fix that required the change of the search engine’s algorithm. This shows that the FDR estimation is very tricky, even the experts can sometimes get it wrong. # false target hits > # decoy hits Craig and Beavis 2004. Bioinformatics 20, 1466–67. FDR underestimation. Evertt et al. 2010. J Proteome Res. 9, 700-707. Bern and Kil 2011, J Proteome Res. 10, 2123-27.
Our Solution: Decoy Fusion Equal targets and decoys Decoy sequence append to each target protein. Fast Search More Sensitive Search In PEAKS, we used a new approach, called decoy fusion to solve this problem. Instead of mixing the target and decoy databases, we append a decoy sequence to each target protein. So, after the fast search round, the protein shortlist will still contain the same length of target and decoy sequences. And the false hits of the second round will have the equal chance to be from the target and decoy sequences. This recreates the balance and can accurately estimate the FDR in the multiple-round search setting. # false target hits ≈ # decoy hits PEAKS DB paper. Submitted.
Pitfall 2 – Mix Protein and Peptide ID A weak hit is “saved” due to the bonus. So is this weak false hit. Idea: Peptides on a multi-hit protein get a bonus on their scores to increase sensitivity. Pitfall target false hit More multi-hit proteins from target DB ⇒ more false hits are “saved” from target DB ⇒ FDR underestimation. The second pitfall of the traditional target-decoy strategy is caused by another popular technique used to increase the peptide identification sensitivity. The idea is clever: if a weakly identified peptide happens to be on a highly-confident protein, then the peptide is likely to be correct regardless of its low score. So, to increase the sensitivity, the software can add a score bonus to each peptide on a multiple-hit protein. Indeed, this protein bonus will save some weak true hits, but it will save some weak false hits at the same time. The bigger problem is that the target database will provide more multiple-hit proteins than the decoy. As a result, more weak false hits will be saved from the target database. This will cause the FDR underestimation. decoy hit
Our Solution: Decoy Fusion Weak false hits are “saved” with approx. equal probabilities in target and decoy. Our earlier proposed decoy fusion approach can solve this problem effectively. Because the target and decoy sequences are concatenated into a single protein sequence, when a protein bonus is added to the multiple-hit proteins, the same bonus will be added to the target and decoy hits equally. So, weak false hits are saved with approximately equal probabilities in the target and decoy. This recreates the balance and provides accurate FDR estimation. By using the decoy fusion as the validation method, we can safely apply the protein bonus. We get the sensitivity, but did not compromise the FDR estimation. Get the sensitivity, but still estimate the FDR correctly.
Pitfall 3 – Machine Learning with Decoy Idea: Re-train the coefficients of scoring function for every search after knowing the decoy hits. Pitfall: Risk of over-fit. Machine learning experts only. Adjust scoring function to remove decoy hits after search. Search The third pitfall is also caused due to the over-emphasis on sensitivity. There is another trend in database search software to rescore the peptide identification results by using machine learning. The idea is straightforward: After the search, we know what the decoy hits are. The algorithm should take advantage of it, and retrain the parameters of the scoring function to get rid of the decoy hits. With this effort, it will get rid of a lot of the target false hits as well. The method is valid, except that it may cause FDR underestimation. This is because the target false hits are unknown to the machine learning algorithm. Therefore, there is a risk that the machine learning algorithm removes more decoy hits than the target false hits. This overfit risk is well known in machine learning. A machine learning expert can reduce the risk but can never get rid of it. target false hits decoy hits Fewer target false hits are removed ⇒FDR underestimation
Solutions Don’t use it. Only use for very large dataset. Judges cannot be players. Only use for very large dataset. Train coefficients and reuse; don’t re-train for every search. or or The solution to this pitfall number 3 is trickier. The first suggestion: don’t use it. The philosophy here is that judges cannot be players. If we want to use the decoy for result validate, the decoy information should never be released to the search algorithm. If this rescoring method must be used due to the low-performance of some database search software, it should only be used for very large dataset to reduce the risk of over-fit. Perhaps the best solution is the third one. That is, the retraining of the score parameters should be done for each different instrument type, instead of each dataset. This will gain much of the benefit provided by machine learning, but without the problem of over-fitting. Indeed, this third approach is what we do in the PEAKS DB algorithm.
PEAKS 5.3 PEAKS DB used all these techniques (and many more) to ensure the accuracy while maximizing sensitivity. Reliable FDR estimation is the top priority in PEAKS DB design. In PEAKS 5.3, we used all the aforementioned techniques, and many more, to ensure the accuracy, while maximizing the sensitivity. The reliable FDR estimation is the top priority in the design of the PEAKS DB algorithm. If there’s a trick that can increase the sensitivity, but may compromise the result validation, we would not use it. The excellent result of PEAKS DB on the ABRF study demonstrated that our philosophy worked. The sensitivity does not have to come at the price of accuracy; and does not have to compromise the result validation. In the next part of this talk, we will learn some more details of PEAKS DB algorithm.
2. De novo sequencing assisted database search A major difference between the PEAKS DB algorithm than other database search algorithms is that PEAKS DB utilizes the excellent PEAKS de novo sequencing results.
An Idea to Improve Score Function Idea: If de novo matches a DB peptide, it is likely to be correct. false Recall that a scoring function is usually not perfect and cannot completely separate the true and false identifications. This caused the accuracy and sensitivity dilemma. To improve the accuracy and sensitivity simultaneously, the scoring function will have to consider more features that can distinguish the true and false matches. One of such features is the match between the de novo sequencing result and the database search result. The idea is simple: since de novo sequencing does not look into a sequence database, if its result is the same as the database search result, then it is a very significant event and both results are likely to be correct. In such a case, even if the database search scoring function gives a low score to the peptide, we should include that peptide in the final result. This idea is exploited in the PEAKS DB algorithm. The similarity between the de novo sequence and the database sequence is used as a feature of the PEAKS DB scoring function. true score
De Novo Assisted DB Search # matched amino acids between de novo & DB search x+4y best separation line This figure illustrates the idea in more details. Each data point is a peptide identified by a commonly used database search engine. The blue and orange points are from the target and the decoy, respectively. The x-axis is the database search score, and the y-axis is the similarity between de novo and database search results, measured by the number of matched amino acids between the two. Clearly, the best separation of the target and decoy is achieved by the combination of the regular database search score and the new de novo matching feature. DB Search Score
Including de novo matching as a feature gives the score function a better discriminative power. false before after Thus, by including the de novo matching as a new score feature, the new scoring function has a better discriminative power to separate the true and false identifications. Consequently, the accuracy and sensitivity can be increased simultaneously. This is just one of many other new features in PEAKS 5.3 for improving the scoring function. true score This is just one example of many other new features in PEAKS 5.3 for improving score function.
… far better than what I could ever squeeze out of my data – Stefano Gotta, Siena Biotech “ ” With all these new scoring features, PEAKS DB provides the unprecedented accuracy and sensitivity for database search. Many of our users happily discover that PEAKS 5.3 can identify many more peptides than their previous software or combination of software, including both the older version of PEAKS and other commonly used software tools. For example, for the ABRF study dataset, at 1% FDR, PEAKS DB identified 60% more peptide-spectrum matches than another commonly used commercial software.
PEAKS DB Workflow All Spectra DB search De Novo No Found? Yes De novo both helps to improve DB search, and reports novel peptides. DB search De Novo No Found? This is the workflow the PEAKS DB algorithm uses. A de novo sequencing analysis is performed for each spectrum in the data. The de novo sequencing result is used to assist the database search and report more database peptides. Additionally, if a spectrum is not confidently assigned with a database peptide, while has a confident de novo sequence tag, the de novo tag is reported in a separate table, called the “de novo only” table in PEAKS DB result. This list is the possible novel or mutated peptides in your sample. By combining the de novo sequencing and database search analyses together, PEAKS DB does not only achieved superior accuracy and sensitivity for database search, but also help discovers novel and mutated peptides. This is only made possible because of the acclaimed PEAKS de novo sequencing algorithm. Yes DB peptides De novo only
3. Three essential examinations to ensure result quality. As we have seen from part 1 of this talk, reliable result validation is an important but difficult task for peptide identification analysis. In part 3 of this talk I will suggest three essential but simple examinations that is suitable for every peptide identification analysis. These three simple examinations will help to effectively ensure the result quality.
Don’t Trust Software Blindly! Google “Don’t trust software blindly” returned 5,140,000 results. As you quality control your experiments, quality control the software’s results too. As a software person working in mass spectrometry area for over ten years, I did notice a tendency that the wet-lab people trusted software more than bioinformaticians did. However, the fact is that you cannot trust software blindly. To illustrate this fact, a Google search of this phrase returned over 5 million results. Just as you quality control your web-lab experiments, you should carefully quality control the software’s results too. Ironically, I should not trust the Google’s results blindly either because a lot of the 5 million results are actually irrelevant.
Essential Examination 1 The first suggested examination for a peptide identification result is the score distribution histogram of the results. The x-axis is the score, and y-axis is the number of peptide-spectrum matches with that score. The target and decoy matches are drawn in different colors. In such a figure, you should notice a low number of decoy hits above the score threshold, indicated by the dashed vertical line. This indicates the high accuracy of the result. Additionally, you should notice a similar number of target and decoy hits at the low score region. This indicates that the FDR estimation is working properly. #decoy ≈ #target in low score region Low #decoy in high score region
Essential Examination 2 The second examination is a scatter plot of the results according to the score and precursor mass errors in parts-per-million. This is particularly useful for today’s high resolution instruments. What you should observe from this figure is that the precursor mass error is low above the score threshold. And the precursor error starts to scatter below the score threshold. The use of the precursor mass error provide a second safety guarantee (??? Is there an English term for this?) for the result quality. Precursor error start to scatter below threshold High scoring peptides should have low precursor error.
Essential Examination 3 Spectrum annotation around score threshold. Finally, the third examination is to check a few peptide-spectrum matches around the score threshold. They should have reasonable matching quality. You do not need to check every peptide. Usually, several randomly sampled peptides would serve the purpose very well. This examination is particularly useful when your dataset or the number of proteins is small, in which case, the FDR estimation using target-decoy or decoy-fusion is less effective.
Take Home Message Another year of dedicated work on PEAKS. Ensured accuracy; maximized sensitivity. Do the three essential examinations. They are simple … at least in PEAKS. To sum up, it has been another year of dedicated work since PEAKS 5.2. The PEAKS DB module in 5.3 is a major improvement over previous versions, as well over other database search software. Based on PEAKS unique de novo sequencing capability, as well as our newly developed decoy-fusion method, we ensured the accuracy, and maximized the sensitivity of PEAKS DB. I also highly recommend the three essential examinations for every database search analysis, whether it is with PEAKS or other software. They will effectively ensure the quality of your database search result. After all, these three analyses are very simple. At least, we made them very simple in PEAKS 5.3.
“a big step forward” – Christian Schmelzer, Martin Luther University We have received numerous positive feedbacks from early users of 5.3. Since they liked it, we are confident that you will like it too. You can now download a 30-day fully-functional trial from BSI’s website. Enjoy it! Enjoy! http://www.bioinfor.com/peaks-download-a-pricing