Protein Identification Using Mass Spectrometry

Slides:

Advertisements

Similar presentations

Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.

Advertisements

1336 SW Bertha Blvd, Portland OR 97219

How to identify peptides October 2013 Gustavo de Souza IMM, OUS.

PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,

De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.

Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.

Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.

Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)

Proteomics Informatics Workshop Part I: Protein Identification

Previous Lecture: Regression and Correlation

Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.

My contact details and information about submitting samples for MS

Proteomics Informatics Workshop Part III: Protein Quantitation

Protein Identification by Sequence Database Search Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.

Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.

Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,

INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.

Common parameters At the beginning one need to set up the parameters.

Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:

Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

Laxman Yetukuri T : Modeling of Proteomics Data

Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

Protein Identification by Sequence Database Search Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.

INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.

PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.

Temple University MASS SPECTROMETRY INTRODUCTION Ilyana Mushaeva and Amber Moscato Department of Electrical and Computer Engineering Temple University.

Proteomics Technology and Protein Identification

Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.

Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.

Proteomics What is it? How is it done? Are there different kinds? Why would you want to do it (what can it tell you)?

INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.

Peptide Identification via Tandem Mass Spectrometry Sorin Istrail.

Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.

EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.

Separates charged atoms or molecules according to their mass-to-charge ratio Mass Spectrometry Frequently.

Lecture-9 MS Techniques and Protein Identification Huseyin Tombuloglu, Phd GBE423 Genomics & Proteomics.

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.

Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information

Protein Identification Using Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College.

Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.

Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Proteomics & Mass Spectrometry

Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.

ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.

2014 생화학 실험 (1) 6주차 실험조교 : 류 지 연 Yonsei Proteome Research Center 산학협동관 421호

Yonsei Proteome Research Center Peptide Mass Finger-Printing Part II. MALDI-TOF 2013 생화학 실험 (1) 6 주차 자료 임종선 조교 내선 6625.

Database Search Algorithm for Identification of Intact Cross-Links in Proteins and Peptides Using Tandem Mass Sepctrometry 신성호.

Algorithms and Computation: Bottom-Up Data Analysis Workflows

Mass Spectrometry 101 (continued) Hackert - CH 370 / 387D

‘Protein sequencing’: Determining protein sequences

The Syllabus. The Syllabus Safety First !!! Students will not be allowed into the lab without proper attire. Proper attire is designed for your protection.

MassMatrix Search Results Explained

2 Dimensional Gel Electrophoresis

Mass spectrometry-based proteomics

Proteomics Informatics David Fenyő

Interpretation of Mass Spectra I

Peptide & Protein Identification by MS/MS

Proteomics Informatics –

Protein Identification Using Tandem Mass Spectrometry

Bioinformatics for Proteomics

Pierre P. Massion, MD, Richard M. Caprioli, PhD

Mass Spectrometry THE MAIN USE OF MS IN ORG CHEM IS:

Shotgun Proteomics in Neuroscience

Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.

Proteomics Informatics David Fenyő

Interpretation of Mass Spectra

Protein Identification by Sequence Database Search

Presentation transcript:

Protein Identification Using Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Proteomics Proteins are the machines that drive much of biology Genes are merely the recipe The direct characterization of a sample’s proteins en masse. What proteins are present? How much of each protein is present? 12/8/2009 BIST535 - 2009

Gene / Transcript / Protein Systems Biology Establish relationships by Choosing related samples, Global characterization, and Comparison. Gene / Transcript / Protein Measurement Predetermined Unknown Discrete (DNA) Genotyping Sequencing Continuous Gene Expression Proteomics 12/8/2009 BIST535 - 2009

Samples Healthy / Diseased Cancerous / Benign Drug resistant / Drug susceptible Bound / Unbound Tissue specific Cellular location specific Mitochondria, Membrane 12/8/2009 BIST535 - 2009

2D Gel-Electrophoresis Protein separation Molecular weight (MW) Isoelectric point (pI) Staining Birds-eye view of protein abundance 12/8/2009 BIST535 - 2009

2D Gel-Electrophoresis Bécamel et al., Biol. Proced. Online 2002;4:94-104. 12/8/2009 BIST535 - 2009

Paradigm Shift Traditional protein chemistry assay methods struggle to establish identity. Identity requires: Specificity of measurement (Precision) Mass spectrometry A reference for comparison (Measurement → Identity) Protein sequence databases 12/8/2009 BIST535 - 2009

Mass Spectrometer Ionizer Sample Mass Analyzer Detector MALDI + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM) 12/8/2009 BIST535 - 2009

Mass Spectrometer (MALDI-TOF) UV (337 nm) Microchannel plate detector Field-free drift zone Source Pulse voltage Analyte/matrix Ed = 0 Length = D Length = s Backing plate (grounded) Extraction grid (source voltage -Vs) Detector grid -Vs 12/8/2009 BIST535 - 2009

Mass Spectrum 12/8/2009 BIST535 - 2009

Mass is fundamental 12/8/2009 BIST535 - 2009

Sample Preparation for MS/MS Enzymatic Digest and Fractionation 12/8/2009 BIST535 - 2009

Single Stage MS MS 12/8/2009 BIST535 - 2009

Tandem Mass Spectrometry (MS/MS) Precursor selection 12/8/2009 BIST535 - 2009

Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS 12/8/2009 BIST535 - 2009

Peptide Fragmentation Peptide: S-G-F-L-E-E-D-E-L-K MW ion 88 b1 S GFLEEDELK y9 1080 145 b2 SG FLEEDELK y8 1022 292 b3 SGF LEEDELK y7 875 405 b4 SGFL EEDELK y6 762 534 b5 SGFLE EDELK y5 633 663 b6 SGFLEE DELK y4 504 778 b7 SGFLEED ELK y3 389 907 b8 SGFLEEDE LK y2 260 1020 b9 SGFLEEDEL K y1 147 12/8/2009 BIST535 - 2009

Peptide Fragmentation 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions 100 % Intensity m/z 12/8/2009 250 BIST535 - 2009 500 750 1000

Peptide Fragmentation 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 y7 % Intensity y5 b3 b4 y2 y3 y4 b5 y8 b6 b8 b7 b9 y9 m/z 12/8/2009 250 BIST535 - 2009 500 750 1000

Peptide Identification Given: The mass of the precursor ion, and The MS/MS spectrum Output: The amino-acid sequence of the peptide 12/8/2009 BIST535 - 2009

Sequence Database Search Compares peptides from a protein sequence database with spectra Filter peptide candidates by Precursor mass Digest motif Score each peptide against spectrum Generate all possible peptide fragments Match putative fragments with peaks Score and rank 12/8/2009 BIST535 - 2009

Sequence Database Search 100 250 500 750 1000 m/z % Intensity K L E D F G S 12/8/2009 BIST535 - 2009

Sequence Database Search 100 250 500 750 1000 m/z % Intensity K 1166 L 1020 E 907 D 778 663 534 405 F 292 G 145 S 88 b ions 147 260 389 504 633 762 875 1022 1080 y ions 12/8/2009 BIST535 - 2009

Sequence Database Search K 1166 L 1020 E 907 D 778 663 534 405 F 292 G 145 S 88 b ions 100 250 500 750 1000 m/z % Intensity 147 260 389 504 633 762 875 1022 1080 y ions y6 y7 y2 y3 y4 y5 y8 y9 b3 b5 b6 b7 b8 b9 b4 12/8/2009 BIST535 - 2009

Sequence Database Search No need for complete ladders Possible to model all known peptide fragments Sequence permutations eliminated All candidates have some biological relevance Practical for high-throughput peptide identification Correct peptide might be missing from database! 12/8/2009 BIST535 - 2009

Peptide Candidate Filtering Digestion Enzyme: Trypsin Cuts just after K or R unless followed by a P. Basic residues (K & R) at C-terminal attract ionizing charge, leading to strong y-ions “Average” peptide length about 10-15 amino-acids Must allow for “missed” cleavage sites 12/8/2009 BIST535 - 2009

Peptide Candidate Filtering Peptide molecular weight Only have m/z value Need to determine charge state Ion selection tolerance Mass for each amino-acid symbol? Monoisotopic vs. Average “Default” residue mass Depends on sample preparation protocol Cysteine almost always modified 12/8/2009 BIST535 - 2009

Peptide Molecular Weight Same peptide, i = # of C13 isotope i=0 i=1 i=2 i=3 i=4 12/8/2009 BIST535 - 2009

Peptide Scoring Peptide fragments vary based on The instrument The peptide’s amino-acid sequence The peptide’s charge state Etc… Search engines model peptide fragmentation to various degrees. Speed vs. sensitivity tradeoff y-ions & b-ions occur most frequently 12/8/2009 BIST535 - 2009

Peptide Identification High-throughput workflows demand we analyze all spectra, all the time. Spectra may not contain enough information to be interpreted correctly …bad static on a cell phone Peptides may not match our assumptions …its all Greek to me “Don’t know” is an acceptable answer! 12/8/2009 BIST535 - 2009

Peptide Identification Rank the best peptide identifications Is the top ranked peptide correct? 12/8/2009 BIST535 - 2009

Peptide Identification Rank the best peptide identifications Is the top ranked peptide correct? 12/8/2009 BIST535 - 2009

Peptide Identification Rank the best peptide identifications Is the top ranked peptide correct? 12/8/2009 BIST535 - 2009

Peptide Identification Incorrect peptide has best score Correct peptide is missing? Potential for incorrect conclusion What score ensures no incorrect peptides? Correct peptide has weak score Insufficient fragmentation, poor score Potential for weakened conclusion What score ensures we find all correct peptides? 12/8/2009 BIST535 - 2009

Statistical Significance Can’t prove particular identifications are right or wrong... ...need to know fragmentation in advance! A minimal standard for identification scores... ...better than guessing. p-value, E-value, statistical significance For each spectrum, compare scores with those of random peptides (p-value, E-value). 12/8/2009 BIST535 - 2009

Random Peptide Models "Generate" random peptides Real looking fragment masses No theoretical model! Must use empirical distribution Usually require they have the correct precursor mass Score function can model anything we like! 12/8/2009 BIST535 - 2009

Random Peptide Models Fenyo & Beavis, Anal. Chem., 2003 12/8/2009 BIST535 - 2009

Random Peptide Models Fenyo & Beavis, Anal. Chem., 2003 12/8/2009 BIST535 - 2009

Random Peptide Models Truly random peptides don’t look much like real peptides Just use (incorrect) peptides from the sequence database! Caveats: Correct peptide (non-random) may be included Peptides are not independent Reverse sequence avoids only the first problem 12/8/2009 BIST535 - 2009

Extrapolating from the Empirical Distribution Often, the empirical shape is consistent with a theoretical model Geer et al., J. Proteome Research, 2004 Fenyo & Beavis, Anal. Chem., 2003 12/8/2009 BIST535 - 2009

False Positive Rate Estimation Each spectrum is a chance to be right, wrong, or inconclusive. At any given threshold, how many peptide identifications are wrong? Computed for entire spectral dataset Given identification criteria: SEQUEST Xcorr, E-value, Score, etc., plus... ...threshold Use “decoy” sequences and repeat search random, reverse, cross-species Identifications must be incorrect! 12/8/2009 BIST535 - 2009

False Positive Rate Estimation # FP in real search = # hits in decoy search Need same size database, or rate conversion FP Rate: # decoy hits with score ≥ thresh # hits with score ≥ thresh 12/8/2009 BIST535 - 2009

False Positive Rate Estimation A form of statistical significance Search engine independent Easy to implement Assumes a single threshold for all spectra Best if E-value or similar is used to compute a spectrum normalized score 12/8/2009 BIST535 - 2009

Peptide Prophet From the Institute for Systems Biology Keller et al., Anal. Chem. 2002 Re-analysis of SEQUEST results Spectrum dependant scores (XCorr) Assumes that many of the spectra are not correctly identified 12/8/2009 BIST535 - 2009

Peptide Prophet Distribution of spectral scores in the results Keller et al., Anal. Chem. 2002 Distribution of spectral scores in the results 12/8/2009 BIST535 - 2009

Peptide Prophet Assumes a bimodal distribution of scores, with a particular shape Ignores database size …but it is included implicitly Like empirical distribution for peptide sampling, can be applied to any score function Can be applied to any search engines’ results 12/8/2009 BIST535 - 2009

Comparison of search engine results No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment 38% 14% 28% 3% 2% 1% X! Tandem SEQUEST Mascot Here is way, no single one gives the best results Q: after improvement, what is the percentage of identified spectra, how is the improvement? 25 – 30% 12/8/2009 BIST535 - 2009 Searle et al. JPR 7(1), 2008

Combining search engine results – harder than it looks! Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too! How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance? We apply unsupervised machine-learning.... Lots of related work unified in a single framework. 12/8/2009 BIST535 - 2009

Supervised Learning 12/8/2009 BIST535 - 2009

Unsupervised Learning 12/8/2009 BIST535 - 2009

PepArML Combining Results Q-TOF Edwards, et al., Clin. Prot. 5(1), 2009 MALDI LTQ 12/8/2009 BIST535 - 2009

Unsupervised Learning U*-TMO U-TMO C-TMO H Edwards, et al., Clin. Prot. 5(1), 2009 12/8/2009 BIST535 - 2009

Peptide Atlas A8_IP LTQ Dataset This moderately sized, real dataset, contains about 100000 spectra X-axis is estimated false discovery rate, y-axis is spectra, and peptides at that FDR. Dotted lines represent individual search engines' E-values. The Heuristic is a robust decoy based combiner that uses only the E-values, which generally slightly beats the best individual search engine. PepArML-TKO uses just Tandem, KScore, and OMSSA. PepArML-All uses all five search engines. Stress that the combiner is using only the results from the individual search engines, no new searches. 12/8/2009 BIST535 - 2009

Peptides to Proteins Nesvizhskii et al., Anal. Chem. 2003 12/8/2009 BIST535 - 2009

Peptides to Proteins 12/8/2009 BIST535 - 2009

Peptides to Proteins A peptide sequence may occur in many different protein sequences Variants, paralogues, protein families Separation, digestion and ionization is not well understood Proteins in sequence database are extremely non-random, and very dependent No great tools for assessing statistical confidence of protein identifications. 12/8/2009 BIST535 - 2009

Summary Protein identification from tandem mass spectra is a key proteomics technology. Protein identifications should be treated with healthy skepticism. All peptide / protein lists represent a triage of the data – look for ways to estimate significance. Lots of open "applied statistics" problems! The devil is in the details – there is no high-moral ground here – whatever is most effective wins. 12/8/2009 BIST535 - 2009