Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

Slides:



Advertisements
Similar presentations
MS-Viewer – A Web Based Spectral Viewer For Database Search Results Peter R. Baker 1, Alma L. Burlingame 1 and Robert J. Chalkley 1 1 Mass Spectrometry.
Advertisements

1336 SW Bertha Blvd, Portland OR 97219
N-Glycopeptide Identification from CID Tandem Mass Spectra using Glycan Databases and False Discovery Rate Estimation Kevin B. Chandler, Petr Pompach,
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Previous Lecture: Regression and Correlation
Scaffold Download free viewer:
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Facts and Fallacies about de Novo Sequencing & Database Search.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
The dynamic nature of the proteome
Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park.
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Nathan Edwards Center for Bioinformatics and Computational Biology
Protein Sequence Databases, Peptides to Proteins, and Statistical Significance Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown.
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Acknowledgements This work is supported by NSF award DBI , and National Center for Glycomics and Glycoproteomics, funded by NIH/NCRR grant 5P41RR
Common parameters At the beginning one need to set up the parameters.
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Protein Identification by Database Searching John Cottrell Matrix Science.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
Glycoprotein Microheterogeneity via N-Glycopeptide Identification Kevin Brown Chandler, Petr Pompach, Radoslav Goldman, Nathan Edwards Georgetown University.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Error tolerant search Large number of spectra remain without significant score. Reasonable number of fragment ion peaks might have not match. – Underestimated.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Top-down characterization of proteins in bacteria with unsequenced genomes Colin Wynne Catherine Fenselau University of Maryland, College Park Nathan Edwards.
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),
Algorithms and Computation: Bottom-Up Data Analysis Workflows
Jarrett Egertson, Ph.D. MacCoss Lab
Protein Identification via Database searching
Protein Inference by Generalized Protein Parsimony reduces False Positive Proteins in Bottom-Up Workflows Nathan J. Edwards, Department of Biochemistry.
Proteomics Informatics David Fenyő
Proteomics Informatics –
Protein Identification Using Mass Spectrometry
Top-down protein identification.
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Proteomics Informatics David Fenyő
Kuen-Pin Wu Institute of Information Science Academia Sinica
Generalized Protein Parsimony
Presentation transcript:

Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Xue Wu, Chau-Wen Tseng Department of Computer Science University of Maryland, College Park

2 Lost peptide identifications Missing from the sequence database Search engine strengths, weaknesses, quirks Poor score or statistical significance Thorough search takes too long

3 Lost peptide identifications Missing from the sequence database Build exhaustive peptide sequence databases Search engine strengths, weaknesses, quirks Use multiple search engines and combine results Poor score or statistical significance Use spectral-matching to identify weak spectra Use search-engine consensus to boost confidence Use machine-learning to distinguish true from false Thorough search takes too long Harness the power of heterogeneous computational grids

4 Peptide Sequence Databases All peptides at most 30 amino-acids long from: IPI and all IPI constituent protein sequences IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank SwissProt variants, conflicts, splices, and signal peptide truncations. Genbank and RefSeq mRNA sequence 3 frame translation GenBank EST and HTC sequences 6 frame translation and found in at least 2 sequences Grouped by UniGene cluster and compressed.

5 Formatted as a FASTA sequence database Easy integration with search engines. One entry per gene/cluster. Automated rebuild every few months. Peptide Sequence Databases OrganismSize (AA)Size (Entries) Human209Mb75,043 Mouse151Mb55,929 Rat 67Mb43,211 Zebra-fish 90Mb47,922

6 Spectral Matching with HMMs

7 I0I0 b1b1 I1I1 I2I2 I3I3 I4I4 I5I5 I6I6 y1y1 b2b2 y2y2 b3b3 y3y3 11%17% 6%94%8%0%11%86%17%0%6%92%19%

8 Hidden Markov Model Ion Delete Insert (m/z,int) pair emitted by ion & insert states

9 Boosting Identification Sensitivity TestTrainOther (High confidence ids only) OtherModelNone (Low confidence ids)

10 Spectral Matching of Peptide Variants DFLAGGVAAAISK DFLAGGIAAAISK

11 Spectral Matching Extrapolation

12 Comparison of search engine results No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment Searle et al. JPR 7(1), % 14% 28% 14% 3% 2% 1% X! Tandem SEQUEST Mascot

13 Combining search engine results – harder than it looks! Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too! How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance? We apply unsupervised machine-learning.... Lots of related work unified in a single framework.

14 Supervised Learning

15 Unsupervised Learning

16 PepArML Combining Results Q-TOF LTQ MALDI

17 Unsupervised Learning H C-TMO U-TMO U*-TMO False Positive RateIteration

18 Searching for Consensus Search engine quirks can destroy consensus Initial methionine loss as tryptic peptide Charge state enumeration or guessing X!Tandem's refinement mode Pyro-Gln, Pyro-Glu modifications Difficulty tracking spectrum identifiers Precursor mass tolerance (Da vs ppm) Decoy searches must be identical!

19 Configuring for Consensus Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially modifications and protein identifiers

20 Peptide Identification Meta-Search Parameters Instrument Precursor Tolerance Fragment Tolerance Max. Charge Sequence Database Target/Decoy Modification Fixed/Variable Amino-Acids Position Delta Proteolytic Agent Motif Peptide Candidates Termini Specificity Precursor Tolerance Missed cleavages Charge State Handling # 13 C Peaks Search Engines Mascot, X!Tandem OMSSA, MyriMatch

21 Peptide Identification Meta-Search Simple unified search interface for: Mascot, X!Tandem OMSSA, Myrimatch Automatic decoy searches Automatic spectrum file "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid

22 Peptide Identification Meta-Search NSF TeraGrid CPUs UMIACS 250+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Heterogeneous compute resources Simple search request

23 Conclusions Improve sensitivity of peptide identification Exhaustive peptide sequence databases Machine-learning for matching and combining Meta-search tools maximize consensus Grid-computing to achieve thorough search

24 Acknowledgements Catherine Fenselau University of Maryland Biochemistry Funding: NIH/NCI, USDA/ARS