Proteomics Informatics David Fenyő
Course Information http://fenyolab.org/pi2018
Protein Identification and Quantitation Samples Peptides Mass Spectrometry Quantity intensity m/z Identity
Central Dogma of Molecular Biology Transcription Replication Translation Modification P
X X Central Dogma of Molecular Biology Slow Fast P Transcription Replication Slow Degradation Translation X Fast Degradation Modification P X
Motivating Example: Protein Regulation GRB7 ERBB4 Breast Cancer ERBB2 ERBB2 GRB7 ERBB4 ERBB2 ERBB2 GRB7 ERBB4 ERBB2 ERBB2
Motivating Example: Protein Complexes Alber et al., Nature 2007
Motivating Example: Signaling Choudhary & Mann, Nature Reviews Molecular Cell Biology 2010
Identified and Quantified Proteins Mass Spectrometry Based Proteomics Lysis Fractionation Digestion Mass spectrometry Peak Finding Charge determination De-isotoping Integrating Peaks Searching MS Identified and Quantified Proteins
Ion Source Mass Analyzer Detector Mass Spectrometry intensity mass/charge
y b Mass Spectrometry Mass Analyzer 1 Frag-mentation Detector Ion Source Mass Analyzer 2 y b
Example data – ESI-LC-MS/MS m/z m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 MS/MS Time
Information Content in a Single Mass Measurement Human 10 8 6 Avg. #of matching peptides 4 3 2 1 2 3 4 6 8 10 1 #of matching peptides 1000 2000 3000 Tryptic peptide mass [Da] S. cerevisiae 10 8 6 Avg. #of matching peptides 4 3 2 1 2 3 4 6 8 10 1 #of matching peptides 1000 2000 3000 Tryptic peptide mass [Da]
Compare, score, test significance Identified peptides and proteins Protein Identification by Mass Spectrometry Samples Peptides MS/MS Protein DB Compare, score, test significance Identified peptides and proteins
Repeat for all proteins Compare, Score, Test Significance Tandem MS – Database Search Sequence DB Lysis Fractionation Pick Protein Digestion LC-MS Pick Peptide Repeat for all proteins MS/MS All Fragment Masses all peptides Repeat for MS/MS Compare, Score, Test Significance
Search Results
Search Results Most proteins show very reproducible peptide patterns
Search Results
Compare, Score, Test Significance Spectrum Library Search Spectrum Library Lysis Fractionation Digestion LC-MS/MS Pick Spectrum all spectra Repeat for MS/MS Compare, Score, Test Significance Identified Proteins
Interpretation of Mass Spectra K L E D F G S m/z % Relative Abundance 100 250 500 750 1000
Interpretation of Mass Spectra K L E D F G S K 1166 L 1020 E 907 D 778 663 534 405 F 292 G 145 S 88 b ions m/z % Relative Abundance 100 250 500 750 1000
Interpretation of Mass Spectra K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000
Interpretation of Mass Spectra K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Interpretation of Mass Spectra K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Interpretation of Mass Spectra K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 113 113
Interpretation of Mass Spectra K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 129 129
Interpretation of Mass Spectra K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Interpretation of Mass Spectra K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Interpretation of Mass Spectra K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
De Novo Sequencing Sequences consistent with spectrum Amino acid masses 762 100 875 [M+2H]2+ % Relative Abundance 633 292 405 260 389 534 1022 504 663 778 907 1020 1080 250 500 750 1000 m/z Mass Differences Sequences consistent with spectrum
Significance Testing False protein identification is caused by random matching An objective criterion for testing the significance of protein identification results is necessary. The significance of protein identifications can be tested once the distribution of scores for false results is known.
C I Protein Quantitation by Mass Spectrometry Sample i Protein j Lysis ij Protein j Lysis Peptide k Fractionation Digestion MS I LC - MS ik
Protein Quantitation by Mass Spectrometry
Protein Quantitation by Mass Spectrometry
Protein Quantitation by Mass Spectrometry
Protein Quantitation by Mass Spectrometry Light Heavy Lysis Assumption: All losses after mixing are identical for the heavy and light isotopes and Fractionation Digestion Sample i Protein j Peptide k LC-MS MS H L Oda et al. PNAS 96 (1999) 6591 Ong et al. MCP 1 (2002) 376
Protein Quantitation MS MS MS/MS MS/MS LC-MS Digestion Fractionation Shotgun proteomics LC-MS Targeted MS 1. Records M/Z 1. Select precursor ion MS MS Digestion 2. Selects peptides based on abundance and fragments Fractionation 2. Precursor fragmentation MS/MS MS/MS Lysis 3. Protein database search for peptide identification 3. Use Precursor-Fragment pairs for identification Data Dependent Acquisition (DDA) Uses predefined set of peptides
Compare, score, test significance Identified peptides and proteins Proteogenomics Samples Peptides MS/MS Protein DB Compare, score, test significance Identified peptides and proteins
Proteogenomics Next-generation sequencing of the genome Samples and transcriptome Samples Peptides MS/MS Sample-specific Protein DB Compare, score, test significance Identified peptides and proteins
Proteogenomics Non-Tumor Sample Genome sequencing Identify germline variants Genome sequencing RNA-Seq Tumor Sample Identify alternative splicing, somatic variants and novel expression TCGAGAGCTG TCGATAGCTG Exon 1 Exon 2 Exon 3 Variants Alt. Splicing Novel Expression Exon X Fusion Genes Gene X Gene Y Tumor Specific Protein DB Reference Human Database (Ensembl)
Proteogenomics ERBB2 Breast Cancer Breast
Proteogenomics ERBB2 Breast Cancer Breast Ovarian Cancer
Posttranslational Modifications Peptide with two possible modification sites Matching MS/MS spectrum Intensity m/z Which assignment does the data support? 1, 1 or 2, or 1 and 2?
Protein Interactions Digestion Mass spectrometry Identification E F A B Digestion Mass spectrometry Identification
Data Analysis - Normalization Normalized: mean=0, std=1 Raw Data
Data Analysis - Normalization Normalized 3 replicates Normalized 3 replicates + one more replicate a few months later
Data Analysis
FDA calls them “in vitro diagnostic multivariate assays” Molecular Markers A molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest. FDA calls them “in vitro diagnostic multivariate assays”