Protein Identification Using Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
Proteomics Proteins are the machines that drive much of biology Genes are merely the recipe The direct characterization of a sample’s proteins en masse. What proteins are present? How much of each protein is present? 12/8/2009 BIST535 - 2009
Gene / Transcript / Protein Systems Biology Establish relationships by Choosing related samples, Global characterization, and Comparison. Gene / Transcript / Protein Measurement Predetermined Unknown Discrete (DNA) Genotyping Sequencing Continuous Gene Expression Proteomics 12/8/2009 BIST535 - 2009
Samples Healthy / Diseased Cancerous / Benign Drug resistant / Drug susceptible Bound / Unbound Tissue specific Cellular location specific Mitochondria, Membrane 12/8/2009 BIST535 - 2009
2D Gel-Electrophoresis Protein separation Molecular weight (MW) Isoelectric point (pI) Staining Birds-eye view of protein abundance 12/8/2009 BIST535 - 2009
2D Gel-Electrophoresis Bécamel et al., Biol. Proced. Online 2002;4:94-104. 12/8/2009 BIST535 - 2009
Paradigm Shift Traditional protein chemistry assay methods struggle to establish identity. Identity requires: Specificity of measurement (Precision) Mass spectrometry A reference for comparison (Measurement → Identity) Protein sequence databases 12/8/2009 BIST535 - 2009
Mass Spectrometer Ionizer Sample Mass Analyzer Detector MALDI + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM) 12/8/2009 BIST535 - 2009
Mass Spectrometer (MALDI-TOF) UV (337 nm) Microchannel plate detector Field-free drift zone Source Pulse voltage Analyte/matrix Ed = 0 Length = D Length = s Backing plate (grounded) Extraction grid (source voltage -Vs) Detector grid -Vs 12/8/2009 BIST535 - 2009
Mass Spectrum 12/8/2009 BIST535 - 2009
Mass is fundamental 12/8/2009 BIST535 - 2009
Sample Preparation for MS/MS Enzymatic Digest and Fractionation 12/8/2009 BIST535 - 2009
Single Stage MS MS 12/8/2009 BIST535 - 2009
Tandem Mass Spectrometry (MS/MS) Precursor selection 12/8/2009 BIST535 - 2009
Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS 12/8/2009 BIST535 - 2009
Peptide Fragmentation Peptide: S-G-F-L-E-E-D-E-L-K MW ion 88 b1 S GFLEEDELK y9 1080 145 b2 SG FLEEDELK y8 1022 292 b3 SGF LEEDELK y7 875 405 b4 SGFL EEDELK y6 762 534 b5 SGFLE EDELK y5 633 663 b6 SGFLEE DELK y4 504 778 b7 SGFLEED ELK y3 389 907 b8 SGFLEEDE LK y2 260 1020 b9 SGFLEEDEL K y1 147 12/8/2009 BIST535 - 2009
Peptide Fragmentation 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions 100 % Intensity m/z 12/8/2009 250 BIST535 - 2009 500 750 1000
Peptide Fragmentation 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 y7 % Intensity y5 b3 b4 y2 y3 y4 b5 y8 b6 b8 b7 b9 y9 m/z 12/8/2009 250 BIST535 - 2009 500 750 1000
Peptide Identification Given: The mass of the precursor ion, and The MS/MS spectrum Output: The amino-acid sequence of the peptide 12/8/2009 BIST535 - 2009
Sequence Database Search Compares peptides from a protein sequence database with spectra Filter peptide candidates by Precursor mass Digest motif Score each peptide against spectrum Generate all possible peptide fragments Match putative fragments with peaks Score and rank 12/8/2009 BIST535 - 2009
Sequence Database Search 100 250 500 750 1000 m/z % Intensity K L E D F G S 12/8/2009 BIST535 - 2009
Sequence Database Search 100 250 500 750 1000 m/z % Intensity K 1166 L 1020 E 907 D 778 663 534 405 F 292 G 145 S 88 b ions 147 260 389 504 633 762 875 1022 1080 y ions 12/8/2009 BIST535 - 2009
Sequence Database Search K 1166 L 1020 E 907 D 778 663 534 405 F 292 G 145 S 88 b ions 100 250 500 750 1000 m/z % Intensity 147 260 389 504 633 762 875 1022 1080 y ions y6 y7 y2 y3 y4 y5 y8 y9 b3 b5 b6 b7 b8 b9 b4 12/8/2009 BIST535 - 2009
Sequence Database Search No need for complete ladders Possible to model all known peptide fragments Sequence permutations eliminated All candidates have some biological relevance Practical for high-throughput peptide identification Correct peptide might be missing from database! 12/8/2009 BIST535 - 2009
Peptide Candidate Filtering Digestion Enzyme: Trypsin Cuts just after K or R unless followed by a P. Basic residues (K & R) at C-terminal attract ionizing charge, leading to strong y-ions “Average” peptide length about 10-15 amino-acids Must allow for “missed” cleavage sites 12/8/2009 BIST535 - 2009
Peptide Candidate Filtering Peptide molecular weight Only have m/z value Need to determine charge state Ion selection tolerance Mass for each amino-acid symbol? Monoisotopic vs. Average “Default” residue mass Depends on sample preparation protocol Cysteine almost always modified 12/8/2009 BIST535 - 2009
Peptide Molecular Weight Same peptide, i = # of C13 isotope i=0 i=1 i=2 i=3 i=4 12/8/2009 BIST535 - 2009
Peptide Scoring Peptide fragments vary based on The instrument The peptide’s amino-acid sequence The peptide’s charge state Etc… Search engines model peptide fragmentation to various degrees. Speed vs. sensitivity tradeoff y-ions & b-ions occur most frequently 12/8/2009 BIST535 - 2009
Peptide Identification High-throughput workflows demand we analyze all spectra, all the time. Spectra may not contain enough information to be interpreted correctly …bad static on a cell phone Peptides may not match our assumptions …its all Greek to me “Don’t know” is an acceptable answer! 12/8/2009 BIST535 - 2009
Peptide Identification Rank the best peptide identifications Is the top ranked peptide correct? 12/8/2009 BIST535 - 2009
Peptide Identification Rank the best peptide identifications Is the top ranked peptide correct? 12/8/2009 BIST535 - 2009
Peptide Identification Rank the best peptide identifications Is the top ranked peptide correct? 12/8/2009 BIST535 - 2009
Peptide Identification Incorrect peptide has best score Correct peptide is missing? Potential for incorrect conclusion What score ensures no incorrect peptides? Correct peptide has weak score Insufficient fragmentation, poor score Potential for weakened conclusion What score ensures we find all correct peptides? 12/8/2009 BIST535 - 2009
Statistical Significance Can’t prove particular identifications are right or wrong... ...need to know fragmentation in advance! A minimal standard for identification scores... ...better than guessing. p-value, E-value, statistical significance For each spectrum, compare scores with those of random peptides (p-value, E-value). 12/8/2009 BIST535 - 2009
Random Peptide Models "Generate" random peptides Real looking fragment masses No theoretical model! Must use empirical distribution Usually require they have the correct precursor mass Score function can model anything we like! 12/8/2009 BIST535 - 2009
Random Peptide Models Fenyo & Beavis, Anal. Chem., 2003 12/8/2009 BIST535 - 2009
Random Peptide Models Fenyo & Beavis, Anal. Chem., 2003 12/8/2009 BIST535 - 2009
Random Peptide Models Truly random peptides don’t look much like real peptides Just use (incorrect) peptides from the sequence database! Caveats: Correct peptide (non-random) may be included Peptides are not independent Reverse sequence avoids only the first problem 12/8/2009 BIST535 - 2009
Extrapolating from the Empirical Distribution Often, the empirical shape is consistent with a theoretical model Geer et al., J. Proteome Research, 2004 Fenyo & Beavis, Anal. Chem., 2003 12/8/2009 BIST535 - 2009
False Positive Rate Estimation Each spectrum is a chance to be right, wrong, or inconclusive. At any given threshold, how many peptide identifications are wrong? Computed for entire spectral dataset Given identification criteria: SEQUEST Xcorr, E-value, Score, etc., plus... ...threshold Use “decoy” sequences and repeat search random, reverse, cross-species Identifications must be incorrect! 12/8/2009 BIST535 - 2009
False Positive Rate Estimation # FP in real search = # hits in decoy search Need same size database, or rate conversion FP Rate: # decoy hits with score ≥ thresh # hits with score ≥ thresh 12/8/2009 BIST535 - 2009
False Positive Rate Estimation A form of statistical significance Search engine independent Easy to implement Assumes a single threshold for all spectra Best if E-value or similar is used to compute a spectrum normalized score 12/8/2009 BIST535 - 2009
Peptide Prophet From the Institute for Systems Biology Keller et al., Anal. Chem. 2002 Re-analysis of SEQUEST results Spectrum dependant scores (XCorr) Assumes that many of the spectra are not correctly identified 12/8/2009 BIST535 - 2009
Peptide Prophet Distribution of spectral scores in the results Keller et al., Anal. Chem. 2002 Distribution of spectral scores in the results 12/8/2009 BIST535 - 2009
Peptide Prophet Assumes a bimodal distribution of scores, with a particular shape Ignores database size …but it is included implicitly Like empirical distribution for peptide sampling, can be applied to any score function Can be applied to any search engines’ results 12/8/2009 BIST535 - 2009
Comparison of search engine results No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment 38% 14% 28% 3% 2% 1% X! Tandem SEQUEST Mascot Here is way, no single one gives the best results Q: after improvement, what is the percentage of identified spectra, how is the improvement? 25 – 30% 12/8/2009 BIST535 - 2009 Searle et al. JPR 7(1), 2008
Combining search engine results – harder than it looks! Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too! How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance? We apply unsupervised machine-learning.... Lots of related work unified in a single framework. 12/8/2009 BIST535 - 2009
Supervised Learning 12/8/2009 BIST535 - 2009
Unsupervised Learning 12/8/2009 BIST535 - 2009
PepArML Combining Results Q-TOF Edwards, et al., Clin. Prot. 5(1), 2009 MALDI LTQ 12/8/2009 BIST535 - 2009
Unsupervised Learning U*-TMO U-TMO C-TMO H Edwards, et al., Clin. Prot. 5(1), 2009 12/8/2009 BIST535 - 2009
Peptide Atlas A8_IP LTQ Dataset This moderately sized, real dataset, contains about 100000 spectra X-axis is estimated false discovery rate, y-axis is spectra, and peptides at that FDR. Dotted lines represent individual search engines' E-values. The Heuristic is a robust decoy based combiner that uses only the E-values, which generally slightly beats the best individual search engine. PepArML-TKO uses just Tandem, KScore, and OMSSA. PepArML-All uses all five search engines. Stress that the combiner is using only the results from the individual search engines, no new searches. 12/8/2009 BIST535 - 2009
Peptides to Proteins Nesvizhskii et al., Anal. Chem. 2003 12/8/2009 BIST535 - 2009
Peptides to Proteins 12/8/2009 BIST535 - 2009
Peptides to Proteins A peptide sequence may occur in many different protein sequences Variants, paralogues, protein families Separation, digestion and ionization is not well understood Proteins in sequence database are extremely non-random, and very dependent No great tools for assessing statistical confidence of protein identifications. 12/8/2009 BIST535 - 2009
Summary Protein identification from tandem mass spectra is a key proteomics technology. Protein identifications should be treated with healthy skepticism. All peptide / protein lists represent a triage of the data – look for ways to estimate significance. Lots of open "applied statistics" problems! The devil is in the details – there is no high-moral ground here – whatever is most effective wins. 12/8/2009 BIST535 - 2009