Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

Slides:



Advertisements
Similar presentations
1 st MS 2 2 nd 3 rd 4 th 5 th 6 th 10 th 9 th 8 th 7 th Relative Intensity Fill Times Scan Times “shotgun sequencing”
Advertisements

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
1336 SW Bertha Blvd, Portland OR 97219
N-Glycopeptide Identification from CID Tandem Mass Spectra using Glycan Databases and False Discovery Rate Estimation Kevin B. Chandler, Petr Pompach,
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Each results report will contain:
Scaffold Download free viewer:
Facts and Fallacies about de Novo Sequencing & Database Search.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University.
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Iowa State University Developmental Robotics Laboratory Unsupervised Segmentation of Audio Speech using the Voting Experts Algorithm Matthew Miller, Alexander.
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Acknowledgements This work is supported by NSF award DBI , and National Center for Glycomics and Glycoproteomics, funded by NIH/NCRR grant 5P41RR
Common parameters At the beginning one need to set up the parameters.
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Laxman Yetukuri T : Modeling of Proteomics Data
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
Peptidesproteinsgenes protein accessionsharedsharedunique gene nameshareduniqueunique Identified by gene unique peptides Identified by protein and gene.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Glycoprotein Microheterogeneity via N-Glycopeptide Identification Kevin Brown Chandler, Petr Pompach, Radoslav Goldman, Nathan Edwards Georgetown University.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
NTU & MSRA Ming-Feng Tsai
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
PeptideShaker Overview What makes PeptideShaker special? - proteomics: shaken, not stirred! 1)Free, open-source and platform independent! 2)Focus on user-friendliness.
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Top-down characterization of proteins in bacteria with unsequenced genomes Colin Wynne Catherine Fenselau University of Maryland, College Park Nathan Edwards.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Constructing high resolution consensus spectra for a peptide library
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
Database Search Algorithm for Identification of Intact Cross-Links in Proteins and Peptides Using Tandem Mass Sepctrometry 신성호.
Algorithms and Computation: Bottom-Up Data Analysis Workflows
A Database of Peak Annotations of Empirically Derived Mass Spectra
Protein Identification via Database searching
Protein Inference by Generalized Protein Parsimony reduces False Positive Proteins in Bottom-Up Workflows Nathan J. Edwards, Department of Biochemistry.
Proteomics Informatics –
Protein Identification Using Mass Spectrometry
Top-down protein identification.
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Generalized Protein Parsimony
Presentation transcript:

Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

2 Peptide Identification Results Search engines provide an answer for every spectrum... Can we figure out which ones to believe? Why is this hard? Hard to determine “good” scores Significance estimates are unreliable Need more ids from weak spectra Each search engine has its strengths and weaknesses Search engines give different answers

3 Mascot Search Results

4 Translation start-site correction Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins Goo, et al. MCP GdhA1 gene: Glutamate dehydrogenase A1 Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0 prediction(s)

5 Halobacterium sp. NRC-1 ORF: GdhA1 K-score E-value vs 10% FDR Many peptides inconsistent with annotated translation start site of NP_279651

6 Translation start-site correction

7 Search engine scores are inconsistent! Mascot Tandem

8 Common Algorithmic Framework – Different Results Pre-process experimental spectra Charge state, cleaning, binning Filter peptide candidates Decide which PSMs to evaluate Score peptide-spectrum match Fragmentation modeling, dot product Rank peptides per spectrum Retain statistics per spectrum Estimate E-values Appy empirical or theoretical model

9 Comparison of search engines No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment 4% OMSSA 10% 2% 5%9% 69% 2% X!Tandem Mascot

10 Lots of techniques out there Treat search engines as black-boxes Generate PSMs + scores, features Apply supervised machine learning to results Use multiple match metrics Combine/refine using multiple search engines Agreement suggests correctness Use empirical significance estimates “Decoy” databases (FDR)

11 Machine Learning Use of multiple metrics of PSM quality: Precursor delta, trypsin digest features, etc Requires "training" with examples Different examples will change the result Generalization is always the question Scores can be hard to "understand" Difficult to establish statistical significance Peptide Prophet's discriminant function Weighted linear combination of features

12 Combine / Merge Results Threshold peptide-spectrum matches from each of two search engines PSMs agree → boost specificity PSMs from one → boost sensitivity PSMs disagree → ????? Sometimes agreement is "lost" due to threshold... How much should agreement increase our confidence? Scores easy to "understand" Difficult to establish statistical significance How to generalize to more engines?

13 Consensus and Meta-Search Multiple witnesses increase confidence As long as they are independent Example: Getting the story straight Independent "random" hits unlikely to agree Agreement is indication of biased sampling Example: loaded dice Meta-search is relatively easy Merging and re-ranking is hard Example: Booking a flight to Denver! Scores and E-values are not comparable How to choose the best answer? Example: Best E-value favors Tandem!

14 Searching for Consensus Search engine quirks can destroy consensus Initial methionine loss as tryptic peptide Charge state enumeration or guessing X!Tandem's refinement mode Pyro-Gln, Pyro-Glu modifications Difficulty tracking spectrum identifiers Precursor mass tolerance (Da vs ppm) Decoy searches must be identical!

15 Configuring for Consensus Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially modifications and protein identifiers

16 Peptide Identification Meta-Search Simple unified search interface for: Mascot, X!Tandem, K- Score, S-Score, OMSSA, MyriMatch, InsPecT Automatic decoy searches Automatic spectrum file "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid

17 Peptide Identification Grid-Enabled Meta-Search NSF TeraGrid CPUs UMIACS 250+ CPUs Edwards Lab Scheduler & 80+ CPUs Secure communication Heterogeneous compute resources Single, simple search request Scales easily to 250+ simultaneous searches X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). X!Tandem, KScore, OMSSA. X!Tandem, KScore, OMSSA.

18 PepArML Peptide identification arbiter by machine learning Unifies these ideas within a model- free, combining machine learning framework Unsupervised training procedure

19 PepArML Overview X!Tandem Mascot OMSSA Other PepArML Feature extraction

20 Dataset Construction T F T X!TandemMascotOMSSA T ……

21 Voting Heuristic Combiner Choose PSM with most votes Break ties using FDR Select PSM with min. FDR of tied votes How to apply this to a decoy database? Lots of possibilities – all imperfect Now using: 100*#votes – min. decoy hits

22 Supervised Learning

23 Feature Evaluation

24 Application to Real Data How well do these models generalize? Different instruments Spectral characteristics change scores Search parameters Different parameters change score values Supervised learning requires (Synthetic) experimental data from every instrument Search results from available search engines Training/models for all parameters x search engine sets x instruments

25 Model Generalization

26 Unsupervised Learning

27 Unsupervised Learning Performance

28 Unsupervised Learning Convergence

29 Peptide Atlas A8_IP – LTQ

30 OMICS 17 Protein Mix – LCQ

31 Feature Selection (InfoGain)

32 Conclusions Combining search results from multiple engines can be very powerful Boost both sensitivity and specificity Running multiple search engines is hard Statistical significance is hard Use empirical FDR estimates...but be careful...lots of subtleties Consensus is powerful, but fragile Search engine quirks can destroy it "Witnesses" are not independent