Download presentation
Presentation is loading. Please wait.
1
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland, College Park, and Georgetown University Medical Center
2
2 Comparison of Search Engines No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment Many spectra lack any peptide assignment Searle et al. JPR 7(1), 2008 38% 14% 28% 14% 3% 2% 1% X! Tandem SEQUEST Mascot
3
3 Black-box Techniques Significance re-estimation Target-Decoy search Bimodal distribution fit Supervised machine learning Train predictors on synthetic datasets Select and/or create (many) good features Result combiners Incorrect peptide IDs unlikely to match Significance re-estimation Independence and/or supervised model
4
4 PepArML Unified machine learning result combiner Significance re-estimation too! Model-free feature use and result combination Use agreement and features if useful Unsupervised training procedure No loss of classification performance
5
5 PepArML Overview X!Tandem Mascot OMSSA Other PepArML
6
6 PepArML Overview X!Tandem Mascot OMSSA Other PepArML Feature extraction
7
7 Dataset Construction T F T X!TandemMascotOMSSA T ……
8
8 Dataset Construction Calibrant 8 Protein Mix (C8) 4594 MS/MS spectra (LTQ) 618 (11.2%) true positives Sashimi 17mix_test2 (S17) 1389 MS/MS spectra (Q-TOF) 354 (25.4%) true positives AURUM 1.0 (364 Proteins) 7508 MS/MS spectra (MALDI-TOF-TOF) 3775 (50.3%) true positives
9
9 PepArML Machine Learning Machine learning (generally) helps single search engines PepArML result-combiner (C-TMO) improves on single search engines Sometimes combining two search engines works as well, or better, than three
10
10 PepArML vs Search Engines (C8)
11
11 True vs. Est. FDR (C-TMO, C8)
12
12 PepArML vs Search Engines (C8)
13
13 PepArML Pairs vs PepArML (C8)
14
14 Sensitivity Comparison
15
15 Feature Evaluation 1Peptide length 2hyperscore 3precursor mass delta 4# of matched y-ions 5# of matched b-ions 6# of missed cleavages 7sum matched intensity 8E-value 9sentinel 10score 11precursor mass delta 12# of matched ions 13# of matched peaks 14# of missed cleavages 15E-value 16sentinel 17p-value 18# of matched ions 19E-value 20sentinel Tandem OMSSA Mascot
16
16 Application to Real Data How well do these models generalize? Different instruments Spectral characteristics change scores Search parameters Different parameters change score values Supervised learning requires (Synthetic) experimental data from every instrument Search results from available search engines Training/models for all parameters x search engine sets x instruments
17
17 Model Generalization Train C8 / Score S17 Train S17 / Score S17
18
18 Rescuing Machine Learning Train a new machine learning model for every dataset! Generalization not required No predetermined search engines, parameters, instruments, features Perhaps we can “guess” the true proteins Most proteins not in doubt Machine learning can tolerate imperfect labels
19
19 Unsupervised Learning
20
20 Unsupervised Learning (S17)
21
21 Unsupervised Learning (S17)
22
22 Protein Selection Heuristic Modeled on typical protein identification criteria High confidence peptide IDs At least 2 non-overlapping peptides At least 10% sequence coverage Robust, fast convergence Easily enforce additional constraints
23
23 What about real data? Dr. Rado Goldman (LCCC, GUMC) Proteolytic serum peptides from clinical hepatocellular carcinoma samples ~ 200 MALDI MS/MS Spectra (TOF-TOF) PepArML for non-specific search of IPI-Human Increase in confidence & sensitivity Observation of “ragged” proteolytic trimming
24
24 Protein Identification Example M T O *
25
25 Future Directions Apply to more experimental datasets Integrate novel features new search engines, spectral matching multiple searches with varied parameters, sequence databases Construct meta-search engine FDR by bimodal fit instead of decoys Release as open source http://peparml.sourceforge.org
26
26 http://PepArML.SourceForge.Net
27
27 Acknowledgements Xue Wu* & Dr. Chau-Wen Tseng, Computer Science University of Maryland, College Park Dr. Brian Balgley, Dr. Paul Rudnick Calibrant Biosystems & NIST Dr. Rado Goldman, Dr. Yanming An Department of Oncology Georgetown University Medical Center Kam Ho To Biochemistry Masters student Georgetown University Funding: NIH/NCI CPTAC
28
28
29
29 PepArML vs Search Engines (S17)
30
30 PepArML vs Search Engines (S17)
31
31 PepArML Pairs vs PepArML (C8)
32
32 PepArML Pairs vs PepArML (S17)
33
33 PepArML Pairs vs PepArML (S17)
34
34 Unsupervised Learning (C8)
35
35 Unsupervised Learning (C8)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.