Download presentation
Presentation is loading. Please wait.
Published byPierce Rose Modified over 9 years ago
1
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
2
2 Mass Spectrometry for Proteomics Measure mass of many (bio)molecules simultaneously High bandwidth Mass is an intrinsic property of all (bio)molecules No prior knowledge required
3
3 Mass Spectrometer Ionizer Sample + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)
4
4 High Bandwidth
5
5 Mass is fundamental!
6
6 Mass Spectrometry for Proteomics Measure mass of many molecules simultaneously...but not too many, abundance bias Mass is an intrinsic property of all (bio)molecules...but need a reference to compare to
7
7 Mass Spectrometry for Proteomics Mass spectrometry has been around since the turn of the century......why is MS based Proteomics so new? Ionization methods MALDI, Electrospray Protein chemistry & automation Chromatography, Gels, Computers Protein sequence databases A reference for comparison
8
8 Sample Preparation for Peptide Identification Enzymatic Digest and Fractionation
9
9 Single Stage MS MS m/z
10
10 Tandem Mass Spectrometry (MS/MS) Precursor selection m/z
11
11 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS m/z
12
12 The big picture... MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. Key concepts: Spectrum acquisition is unbiased Direct observation of amino-acid sequence Sensitive to minor sequence variation Observed peptides represent folded proteins
13
13 Peptide Identification For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well Peptide sequences from protein sequence databases Swiss-Prot, IPI, NCBI’s nr,... Automated, high-throughput peptide identification in complex mixtures
14
14 Peptide Identification, but... What about novel peptides? Search compressed ESTs (C3, PepSeqDB) What about peak intensity? Spectral matching using HMMs (HMMatch) Which identifications are correct? Unsupervised, model-free, result combiner with false discovery rate estimation
15
15 Why don’t we see more novel peptides? Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!
16
16 What goes missing? Known coding SNPs Novel coding mutations Alternative splicing isoforms Alternative translation start-sites Microexons Alternative translation frames
17
17 Why should we care? Alternative splicing is the norm! Only 20-25K human genes Each gene makes many proteins Proteins have clinical implications Biomarker discovery Evidence for SNPs and alternative splicing stops with transcription Genomic assays, ESTs, mRNA sequence. Little hard evidence for translation start site
18
18 Novel Splice Isoform Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003. LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications
19
19 Novel Splice Isoform
20
20 Novel Splice Isoform
21
21 Novel Mutation HUPO Plasma Proteome Project Pooled samples from 10 male & 10 female healthy Chinese subjects Plasma/EDTA sample protocol Li, et al. Proteomics 2005. (Lab 29) TTR gene Transthyretin (pre-albumin) Defects in TTR are a cause of amyloidosis. Familial amyloidotic polyneuropathy late-onset, dominant inheritance
22
22 Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy
23
23 Novel Mutation
24
24 Searching ESTs Proposed long ago: Yates, Eng, and McCormack; Anal Chem, ’95. Now: Protein sequences are sufficient for protein identification Computationally expensive/infeasible Difficult to interpret Make EST searching feasible for routine searching to discover novel peptides.
25
25 Searching Expressed Sequence Tags (ESTs) Pros No introns! Primary splicing evidence for annotation pipelines Evidence for dbSNP Often derived from clinical cancer samples Cons No frame Large (8Gb) “Untrusted” by annotation pipelines Highly redundant Nucleotide error rate ~ 1%
26
26 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, 20774 FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results
27
27 PepSeq FASTA Databases Organisms: HUMAN, MOUSE, RAT, ZEBRA FISH Peptide Evidence: Genbank mRNA, EST, HTC RefSeq mRNA, Proteins Swiss-Prot/TrEMBL, EMBL, VEGA, H-Inv, IPI Proteins Swiss-Prot variants Swiss-Prot signal peptide & init. Met removal Singe FASTA entry per Gene
28
28 Spectral Matching for Peptide Identification Detection vs. identification Increased sensitivity & specificity No novel peptides! NIST GC/MS Spectral Library Identifies small molecules, 100,000’s of (consensus) spectra Bundled/Sold with many instruments “Dot-product” spectral comparison Current project: Peptide MS/MS
29
29 NIST MS Search: Peptides
30
30 Peptide DLATVYVDVLK
31
31 Protein Families
32
32 Protein Families
33
33 Peptide DLATVYVDVLK
34
34 Hidden Markov Models for Spectral Matching Capture statistical variation and consensus in peak intensity Only need 10 spectra to build a model Capture semantics of peaks Extrapolate model to other peptides Good specificity with superior sensitivity for peptide detection Assign 1000’s of additional spectra (p-value < 10 -5 )
35
35 Hidden Markov Model Ion Delete Insert (m/z,int) pair emitted by ion & insert states
36
36 The devil in the details Intensity normalization Discretize (m/z,int) pairs Viterbi distance as score Compute p-value using “random” spectra
37
37 Random Spectra Uniform sample of (m/z,int) Permutation (m/z) of true spectra peaks M/z distribution between true spectra and uniform sample (parameter) Random TrueFalse Viterbi Score # of spectra
38
38 HMM Peptide Identification Results – DLATV
39
39 Spectral Matching of Peptide Variants DFLAGGVAAAISK DFLAGGIAAAISK
40
40 HMM model extrapolation
41
41 Mascot Search Results
42
42 Peptide Identification Results Search engines always provide an answer Current search engines: Hard to determine “good” scores Significance estimates are unreliable Need better methods!
43
43 Common Algorithmic Framework Pre-process experimental spectra Filter peptide candidates Score match between peptides and spectra Rank peptides and assign
44
44 Comparison of search engines No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment 4% OMSSA 10% 2% 5%9% 69% 2% X!Tandem Mascot
45
45 Lots of published solutions! Treat search engines as black-boxes Apply supervised machine learning to results Use multiple match metrics Combine/refine using multiple search engines Agreement suggests correctness Use empirical significance estimates “Decoy” databases (FDR)
46
46 PepArML Peptide identification arbiter by machine learning Unifies these ideas within a model- free, combining machine learning framework Unsupervised training procedure
47
47 PepArML Overview Unify Tandem, Mascot, and OMSSA results X!Tandem Mascot OMSSA Other PepArML Identified Unidentified
48
48 Voting Heuristic Combiner Choose peptide ID with the most votes Use best FDR as confidence Break ties (single votes) using FDR Strawman for comparison
49
49 Dataset construction Machine Learning x Spectra compare Matched Ions Peak_intensity Mass delta # of missed cleavages Peptide length Tandem Score Mascot Score OMSSA Score Extract Features X!Tandem Mascot OMSSA Other Search Tools
50
50 Dataset construction Build feature vectors T F T TandemMascotOMSSA T ……
51
51 Dataset construction Synthetic protein mixtures provide ground truth C8 8 standard proteins (Calibrant Biosystems) 4594 MS/MS spectra (LTQ) 618 (11.2%) true positives S17 17 standard proteins (Sashimi Repository) 1389 MS/MS spectra (Q-TOF) 354 (25.4%) true positives AURUM 364 standard proteins (AURUM 1.0) 7508 MS/MS spectra (MALDI-TOF-TOF) 3775 (50.3%) true positives
52
52 Machine learning improves single search engines (S17)
53
53 Multiple search engines are better than single search engines (S17)
54
54 Feature Evaluation
55
55 Application to Real Data How well do these models generalize? Different instruments Spectral characteristics change scores Search parameters Different parameters change score values Supervised learning requires (Synthetic) experimental data from every instrument Search results from available search engines Training/models for all parameters x search engine sets x instruments
56
56 Model Generalization
57
57 Rescuing Machine Learning Train a new machine-learning model for every dataset! Generalization not required No predetermined search engines, parameters, instruments, features Perhaps we can “guess” the true proteins Most proteins not in doubt Machine learning can tolerate imperfect labels
58
58 Unsupervised Learning Heuristic selection of “true” proteins Train classifier, predict true peptide IDs Update “true” proteins Heuristic selection of “true” proteins from classifier predictions Iterate until convergence
59
59 Unsupervised Learning Performance
60
60 Unsupervised Learning Convergence
61
61 Conclusions Proteomics can inform genome annotation Eukaryotic and prokaryotic Functional vs silencing variants Peptides identify more than just proteins Untapped source of disease biomarkers Computational inference can make a substantial impact in proteomics
62
62 Conclusions Compressed peptide sequence databases make routine EST searching feasible HMMatch spectral matching improves identification performance for familiar peptides Unsupervised, model-free, combining PepArML framework solves peptide identification interpretation problem
63
63 Acknowledgements Chau-Wen Tseng, Xue Wu UMCP Computer Science Catherine Fenselau UMCP Biochemistry Cheng Lee Calibrant Biosystems PeptideAtlas, HUPO PPP, X!Tandem Funding: NIH/NCI, USDA/ARS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.