Protein Identification Using Mass Spectrometry

Protein Identification Using Mass Spectrometry
Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Proteomics Proteins are the machines that drive much of biology
Genes are merely the recipe The direct characterization of a sample’s proteins en masse. What proteins are present? How much of each protein is present? 12/8/2009 BIST

Gene / Transcript / Protein
Systems Biology Establish relationships by Choosing related samples, Global characterization, and Comparison. Gene / Transcript / Protein Measurement Predetermined Unknown Discrete (DNA) Genotyping Sequencing Continuous Gene Expression Proteomics 12/8/2009 BIST

Samples Healthy / Diseased Cancerous / Benign
Drug resistant / Drug susceptible Bound / Unbound Tissue specific Cellular location specific Mitochondria, Membrane 12/8/2009 BIST

2D Gel-Electrophoresis
Protein separation Molecular weight (MW) Isoelectric point (pI) Staining Birds-eye view of protein abundance 12/8/2009 BIST

2D Gel-Electrophoresis
Bécamel et al., Biol. Proced. Online 2002;4: 12/8/2009 BIST

Paradigm Shift Traditional protein chemistry assay methods struggle to establish identity. Identity requires: Specificity of measurement (Precision) Mass spectrometry A reference for comparison (Measurement → Identity) Protein sequence databases 12/8/2009 BIST

Mass Spectrometer Ionizer Sample Mass Analyzer Detector MALDI
+ _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM) 12/8/2009 BIST

Mass Spectrometer (MALDI-TOF)
UV (337 nm) Microchannel plate detector Field-free drift zone Source Pulse voltage Analyte/matrix Ed = 0 Length = D Length = s Backing plate (grounded) Extraction grid (source voltage -Vs) Detector grid -Vs 12/8/2009 BIST

Mass Spectrum 12/8/2009 BIST

Mass is fundamental 12/8/2009 BIST

Sample Preparation for MS/MS
Enzymatic Digest and Fractionation 12/8/2009 BIST

Single Stage MS MS 12/8/2009 BIST

Tandem Mass Spectrometry (MS/MS)
Precursor selection 12/8/2009 BIST

Tandem Mass Spectrometry (MS/MS)
Precursor selection + collision induced dissociation (CID) MS/MS 12/8/2009 BIST

Peptide Fragmentation
Peptide: S-G-F-L-E-E-D-E-L-K MW ion 88 b1 S GFLEEDELK y9 1080 145 b2 SG FLEEDELK y8 1022 292 b3 SGF LEEDELK y7 875 405 b4 SGFL EEDELK y6 762 534 b5 SGFLE EDELK y5 633 663 b6 SGFLEE DELK y4 504 778 b7 SGFLEED ELK y3 389 907 b8 SGFLEEDE LK y2 260 1020 b9 SGFLEEDEL K y1 147 12/8/2009 BIST

88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions 100 % Intensity m/z 12/8/2009 250 BIST 500 750 1000

88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 y7 % Intensity y5 b3 b4 y2 y3 y4 b5 y8 b6 b8 b7 b9 y9 m/z 12/8/2009 250 BIST 500 750 1000

Peptide Identification
Given: The mass of the precursor ion, and The MS/MS spectrum Output: The amino-acid sequence of the peptide 12/8/2009 BIST

Sequence Database Search
Compares peptides from a protein sequence database with spectra Filter peptide candidates by Precursor mass Digest motif Score each peptide against spectrum Generate all possible peptide fragments Match putative fragments with peaks Score and rank 12/8/2009 BIST

100 250 500 750 1000 m/z % Intensity K L E D F G S 12/8/2009 BIST

100 250 500 750 1000 m/z % Intensity K 1166 L 1020 E 907 D 778 663 534 405 F 292 G 145 S 88 b ions 147 260 389 504 633 762 875 1022 1080 y ions 12/8/2009 BIST

K 1166 L 1020 E 907 D 778 663 534 405 F 292 G 145 S 88 b ions 100 250 500 750 1000 m/z % Intensity 147 260 389 504 633 762 875 1022 1080 y ions y6 y7 y2 y3 y4 y5 y8 y9 b3 b5 b6 b7 b8 b9 b4 12/8/2009 BIST

No need for complete ladders Possible to model all known peptide fragments Sequence permutations eliminated All candidates have some biological relevance Practical for high-throughput peptide identification Correct peptide might be missing from database! 12/8/2009 BIST

Peptide Candidate Filtering
Digestion Enzyme: Trypsin Cuts just after K or R unless followed by a P. Basic residues (K & R) at C-terminal attract ionizing charge, leading to strong y-ions “Average” peptide length about amino-acids Must allow for “missed” cleavage sites 12/8/2009 BIST

Peptide Candidate Filtering
Peptide molecular weight Only have m/z value Need to determine charge state Ion selection tolerance Mass for each amino-acid symbol? Monoisotopic vs. Average “Default” residue mass Depends on sample preparation protocol Cysteine almost always modified 12/8/2009 BIST

Peptide Molecular Weight
Same peptide, i = # of C13 isotope i=0 i=1 i=2 i=3 i=4 12/8/2009 BIST

Peptide Scoring Peptide fragments vary based on
The instrument The peptide’s amino-acid sequence The peptide’s charge state Etc… Search engines model peptide fragmentation to various degrees. Speed vs. sensitivity tradeoff y-ions & b-ions occur most frequently 12/8/2009 BIST

High-throughput workflows demand we analyze all spectra, all the time. Spectra may not contain enough information to be interpreted correctly …bad static on a cell phone Peptides may not match our assumptions …its all Greek to me “Don’t know” is an acceptable answer! 12/8/2009 BIST

Rank the best peptide identifications Is the top ranked peptide correct? 12/8/2009 BIST

Incorrect peptide has best score Correct peptide is missing? Potential for incorrect conclusion What score ensures no incorrect peptides? Correct peptide has weak score Insufficient fragmentation, poor score Potential for weakened conclusion What score ensures we find all correct peptides? 12/8/2009 BIST

Statistical Significance
Can’t prove particular identifications are right or wrong... ...need to know fragmentation in advance! A minimal standard for identification scores... ...better than guessing. p-value, E-value, statistical significance For each spectrum, compare scores with those of random peptides (p-value, E-value). 12/8/2009 BIST

Random Peptide Models "Generate" random peptides
Real looking fragment masses No theoretical model! Must use empirical distribution Usually require they have the correct precursor mass Score function can model anything we like! 12/8/2009 BIST

Random Peptide Models Fenyo & Beavis, Anal. Chem., 2003 12/8/2009
BIST

Random Peptide Models Truly random peptides don’t look much like real peptides Just use (incorrect) peptides from the sequence database! Caveats: Correct peptide (non-random) may be included Peptides are not independent Reverse sequence avoids only the first problem 12/8/2009 BIST

Extrapolating from the Empirical Distribution
Often, the empirical shape is consistent with a theoretical model Geer et al., J. Proteome Research, 2004 Fenyo & Beavis, Anal. Chem., 2003 12/8/2009 BIST

False Positive Rate Estimation
Each spectrum is a chance to be right, wrong, or inconclusive. At any given threshold, how many peptide identifications are wrong? Computed for entire spectral dataset Given identification criteria: SEQUEST Xcorr, E-value, Score, etc., plus... ...threshold Use “decoy” sequences and repeat search random, reverse, cross-species Identifications must be incorrect! 12/8/2009 BIST

# FP in real search = # hits in decoy search Need same size database, or rate conversion FP Rate: # decoy hits with score ≥ thresh # hits with score ≥ thresh 12/8/2009 BIST

A form of statistical significance Search engine independent Easy to implement Assumes a single threshold for all spectra Best if E-value or similar is used to compute a spectrum normalized score 12/8/2009 BIST

Peptide Prophet From the Institute for Systems Biology
Keller et al., Anal. Chem. 2002 Re-analysis of SEQUEST results Spectrum dependant scores (XCorr) Assumes that many of the spectra are not correctly identified 12/8/2009 BIST

Peptide Prophet Distribution of spectral scores in the results
Keller et al., Anal. Chem. 2002 Distribution of spectral scores in the results 12/8/2009 BIST

Peptide Prophet Assumes a bimodal distribution of scores, with a particular shape Ignores database size …but it is included implicitly Like empirical distribution for peptide sampling, can be applied to any score function Can be applied to any search engines’ results 12/8/2009 BIST

Comparison of search engine results
No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment 38% 14% 28% 3% 2% 1% X! Tandem SEQUEST Mascot Here is way, no single one gives the best results Q: after improvement, what is the percentage of identified spectra, how is the improvement? 25 – 30% 12/8/2009 BIST Searle et al. JPR 7(1), 2008

Combining search engine results – harder than it looks!
Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too! How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance? We apply unsupervised machine-learning.... Lots of related work unified in a single framework. 12/8/2009 BIST

Supervised Learning 12/8/2009 BIST

Unsupervised Learning
12/8/2009 BIST

PepArML Combining Results
Q-TOF Edwards, et al., Clin. Prot. 5(1), 2009 MALDI LTQ 12/8/2009 BIST

Unsupervised Learning
U*-TMO U-TMO C-TMO H Edwards, et al., Clin. Prot. 5(1), 2009 12/8/2009 BIST

Peptide Atlas A8_IP LTQ Dataset
This moderately sized, real dataset, contains about spectra X-axis is estimated false discovery rate, y-axis is spectra, and peptides at that FDR. Dotted lines represent individual search engines' E-values. The Heuristic is a robust decoy based combiner that uses only the E-values, which generally slightly beats the best individual search engine. PepArML-TKO uses just Tandem, KScore, and OMSSA. PepArML-All uses all five search engines. Stress that the combiner is using only the results from the individual search engines, no new searches. 12/8/2009 BIST

Peptides to Proteins Nesvizhskii et al., Anal. Chem. 2003 12/8/2009
BIST

Peptides to Proteins 12/8/2009 BIST

Peptides to Proteins A peptide sequence may occur in many different protein sequences Variants, paralogues, protein families Separation, digestion and ionization is not well understood Proteins in sequence database are extremely non-random, and very dependent No great tools for assessing statistical confidence of protein identifications. 12/8/2009 BIST

Summary Protein identification from tandem mass spectra is a key proteomics technology. Protein identifications should be treated with healthy skepticism. All peptide / protein lists represent a triage of the data – look for ways to estimate significance. Lots of open "applied statistics" problems! The devil is in the details – there is no high-moral ground here – whatever is most effective wins. 12/8/2009 BIST

Protein Identification Using Mass Spectrometry

Similar presentations

Presentation on theme: "Protein Identification Using Mass Spectrometry"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Identification Using Mass Spectrometry

Similar presentations

Presentation on theme: "Protein Identification Using Mass Spectrometry"— Presentation transcript:

Similar presentations

About project

Feedback