Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

Protein Quantitation II: Multiple Reaction Monitoring
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
ProReP - Protein Results Parser v3.0©
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Scaffold Download free viewer:
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
My contact details and information about submitting samples for MS
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Proteomics Josh Leung Biology 1220 April 13 th, 2010.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University.
Fa 05CSE182 CSE182-L9 Mass Spectrometry Quantitation and other applications.
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Nathan Edwards Center for Bioinformatics and Computational Biology
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University.
Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Constructing high resolution consensus spectra for a peptide library
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
What is proteomics? Richard Mbasu and Ben Richards.
Considerations for multi-omics data integration Michael Tress CNIO,
bacteria and eukaryotes
Algorithms and Computation: Bottom-Up Data Analysis Workflows
Proteomics Informatics David Fenyő
Protein Identification Using Mass Spectrometry
Bioinformatics for Proteomics
Proteomics Informatics David Fenyő
Presentation transcript:

Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

2 Mass Spectrometry for Proteomics Measure mass of many (bio)molecules simultaneously High bandwidth Mass is an intrinsic property of all (bio)molecules No prior knowledge required

3 Mass Spectrometer Ionizer Sample + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)

4 High Bandwidth

5 Mass is fundamental!

6 Mass Spectrometry for Proteomics Measure mass of many molecules simultaneously...but not too many, abundance bias Mass is an intrinsic property of all (bio)molecules...but need a reference to compare to

7 Mass Spectrometry for Proteomics Mass spectrometry has been around since the turn of the century......why is MS based Proteomics so new? Ionization methods MALDI, Electrospray Protein chemistry & automation Chromatography, Gels, Computers Protein sequence databases A reference for comparison

8 Sample Preparation for Peptide Identification Enzymatic Digest and Fractionation

9 Single Stage MS MS m/z

10 Tandem Mass Spectrometry (MS/MS) Precursor selection m/z

11 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS m/z

12 The big picture... MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. Key concepts: Spectrum acquisition is unbiased Direct observation of amino-acid sequence Sensitive to minor sequence variation Observed peptides represent folded proteins

13 Peptide Identification For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well Peptide sequences from protein sequence databases Swiss-Prot, IPI, NCBI’s nr,... Automated, high-throughput peptide identification in complex mixtures

14 Peptide Identification, but... What about novel peptides? Search compressed ESTs (C3, PepSeqDB) What about peak intensity? Spectral matching using HMMs (HMMatch) Which identifications are correct? Unsupervised, model-free, result combiner with false discovery rate estimation

15 Why don’t we see more novel peptides? Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

16 What goes missing? Known coding SNPs Novel coding mutations Alternative splicing isoforms Alternative translation start-sites Microexons Alternative translation frames

17 Why should we care? Alternative splicing is the norm! Only 20-25K human genes Each gene makes many proteins Proteins have clinical implications Biomarker discovery Evidence for SNPs and alternative splicing stops with transcription Genomic assays, ESTs, mRNA sequence. Little hard evidence for translation start site

18 Novel Splice Isoform Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications

19 Novel Splice Isoform

20 Novel Splice Isoform

21 Novel Mutation HUPO Plasma Proteome Project Pooled samples from 10 male & 10 female healthy Chinese subjects Plasma/EDTA sample protocol Li, et al. Proteomics (Lab 29) TTR gene Transthyretin (pre-albumin) Defects in TTR are a cause of amyloidosis. Familial amyloidotic polyneuropathy late-onset, dominant inheritance

22 Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy

23 Novel Mutation

24 Searching ESTs Proposed long ago: Yates, Eng, and McCormack; Anal Chem, ’95. Now: Protein sequences are sufficient for protein identification Computationally expensive/infeasible Difficult to interpret Make EST searching feasible for routine searching to discover novel peptides.

25 Searching Expressed Sequence Tags (ESTs) Pros No introns! Primary splicing evidence for annotation pipelines Evidence for dbSNP Often derived from clinical cancer samples Cons No frame Large (8Gb) “Untrusted” by annotation pipelines Highly redundant Nucleotide error rate ~ 1%

26 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

27 PepSeq FASTA Databases Organisms: HUMAN, MOUSE, RAT, ZEBRA FISH Peptide Evidence: Genbank mRNA, EST, HTC RefSeq mRNA, Proteins Swiss-Prot/TrEMBL, EMBL, VEGA, H-Inv, IPI Proteins Swiss-Prot variants Swiss-Prot signal peptide & init. Met removal Singe FASTA entry per Gene

28 Spectral Matching for Peptide Identification Detection vs. identification Increased sensitivity & specificity No novel peptides! NIST GC/MS Spectral Library Identifies small molecules, 100,000’s of (consensus) spectra Bundled/Sold with many instruments “Dot-product” spectral comparison Current project: Peptide MS/MS

29 NIST MS Search: Peptides

30 Peptide DLATVYVDVLK

31 Protein Families

32 Protein Families

33 Peptide DLATVYVDVLK

34 Hidden Markov Models for Spectral Matching Capture statistical variation and consensus in peak intensity Only need 10 spectra to build a model Capture semantics of peaks Extrapolate model to other peptides Good specificity with superior sensitivity for peptide detection Assign 1000’s of additional spectra (p-value < )

35 Hidden Markov Model Ion Delete Insert (m/z,int) pair emitted by ion & insert states

36 The devil in the details Intensity normalization Discretize (m/z,int) pairs Viterbi distance as score Compute p-value using “random” spectra

37 Random Spectra Uniform sample of (m/z,int) Permutation (m/z) of true spectra peaks M/z distribution between true spectra and uniform sample (parameter) Random TrueFalse Viterbi Score # of spectra

38 HMM Peptide Identification Results – DLATV

39 Spectral Matching of Peptide Variants DFLAGGVAAAISK DFLAGGIAAAISK

40 HMM model extrapolation

41 Mascot Search Results

42 Peptide Identification Results Search engines always provide an answer Current search engines: Hard to determine “good” scores Significance estimates are unreliable Need better methods!

43 Common Algorithmic Framework Pre-process experimental spectra Filter peptide candidates Score match between peptides and spectra Rank peptides and assign

44 Comparison of search engines No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment 4% OMSSA 10% 2% 5%9% 69% 2% X!Tandem Mascot

45 Lots of published solutions! Treat search engines as black-boxes Apply supervised machine learning to results Use multiple match metrics Combine/refine using multiple search engines Agreement suggests correctness Use empirical significance estimates “Decoy” databases (FDR)

46 PepArML Peptide identification arbiter by machine learning Unifies these ideas within a model- free, combining machine learning framework Unsupervised training procedure

47 PepArML Overview Unify Tandem, Mascot, and OMSSA results X!Tandem Mascot OMSSA Other PepArML Identified Unidentified

48 Voting Heuristic Combiner Choose peptide ID with the most votes Use best FDR as confidence Break ties (single votes) using FDR Strawman for comparison

49 Dataset construction Machine Learning x Spectra compare Matched Ions Peak_intensity Mass delta # of missed cleavages Peptide length Tandem Score Mascot Score OMSSA Score Extract Features X!Tandem Mascot OMSSA Other Search Tools

50 Dataset construction Build feature vectors T F T TandemMascotOMSSA T ……

51 Dataset construction Synthetic protein mixtures provide ground truth C8 8 standard proteins (Calibrant Biosystems) 4594 MS/MS spectra (LTQ) 618 (11.2%) true positives S17 17 standard proteins (Sashimi Repository) 1389 MS/MS spectra (Q-TOF) 354 (25.4%) true positives AURUM 364 standard proteins (AURUM 1.0) 7508 MS/MS spectra (MALDI-TOF-TOF) 3775 (50.3%) true positives

52 Machine learning improves single search engines (S17)

53 Multiple search engines are better than single search engines (S17)

54 Feature Evaluation

55 Application to Real Data How well do these models generalize? Different instruments Spectral characteristics change scores Search parameters Different parameters change score values Supervised learning requires (Synthetic) experimental data from every instrument Search results from available search engines Training/models for all parameters x search engine sets x instruments

56 Model Generalization

57 Rescuing Machine Learning Train a new machine-learning model for every dataset! Generalization not required No predetermined search engines, parameters, instruments, features Perhaps we can “guess” the true proteins Most proteins not in doubt Machine learning can tolerate imperfect labels

58 Unsupervised Learning Heuristic selection of “true” proteins Train classifier, predict true peptide IDs Update “true” proteins Heuristic selection of “true” proteins from classifier predictions Iterate until convergence

59 Unsupervised Learning Performance

60 Unsupervised Learning Convergence

61 Conclusions Proteomics can inform genome annotation Eukaryotic and prokaryotic Functional vs silencing variants Peptides identify more than just proteins Untapped source of disease biomarkers Computational inference can make a substantial impact in proteomics

62 Conclusions Compressed peptide sequence databases make routine EST searching feasible HMMatch spectral matching improves identification performance for familiar peptides Unsupervised, model-free, combining PepArML framework solves peptide identification interpretation problem

63 Acknowledgements Chau-Wen Tseng, Xue Wu UMCP Computer Science Catherine Fenselau UMCP Biochemistry Cheng Lee Calibrant Biosystems PeptideAtlas, HUPO PPP, X!Tandem Funding: NIH/NCI, USDA/ARS