Previous Lecture: Regression and Correlation
This Lecture Introduction to Biostatistics and Bioinformatics Proteomics Informatics
Proteomics Informatics – Learning Objectives Structure of mass spectrometry data Protein identification Protein quantitation
Protein Identification and Quantitation by Mass Spectrometry Samples Peptides Mass Spectrometry Quantity intensity m/z Identity
Sample preparation for protein identification, characterization and quantitation Lysis Fractionation Digestion Mass spectrometry
Overview of Mass spectrometry Ion Source Mass Analyzer Detector intensity mass/charge
Mass Spectrometry (MS)
Example data – MALDI-TOF Peptide intensity vs m/z
Peptide Fragmentation Mass Analyzer 1 Frag-mentation Detector Ion Source Mass Analyzer 2 b y
Liquid Chromatography (LC)-MS/MS Ion Source Mass Analyzer 1 Frag-mentation Mass Analyzer 2 Detector intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge intensity mass/charge Time
Example data – ESI-LC-MS/MS Peptide intensity vs m/z vs time m/z m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 MS/MS Fragment intensity vs m/z Time
Charge-State Distributions MALDI ESI 1+ 2+ 3+ Peptide intensity intensity 4+ 2+ 1+ mass/charge mass/charge M - molecular mass n - number of charges H – mass of a proton MALDI ESI 2+ 27+ 3+ 1+ Protein 31+ intensity 4+ intensity 5+ mass/charge mass/charge
Charge-State Example: M - molecular mass n - number of charges H – mass of a proton Example: peptide of mass 898 carrying 1 H+ = (898 + 1) / 1 = 899 m/z carrying 2 H+ = (898 + 2) / 2 = 450 m/z carrying 3 H+ = (898 + 3) / 3 = 300.3 m/z
Isotope Distributions 12C 14N 16O 1H 32S +1Da Intensity +2Da +3Da m/z m/z m/z 0.015% 2H 1.11% 13C 0.366% 15N 0.038% 17O, 0.200% 18O, 0.75% 33S, 4.21% 34S, 0.02% 36S Only 12C and 13C: p=0.0111 n is the number of C in the peptide m is the number of 13C in the peptide Tm is the relative intensity of the peptide m 13C 𝑇 𝑚 = 𝑛 𝑚 𝑝 𝑚 (1−𝑝) 𝑛−𝑚
Isotope Clusters and Charge State 1+ 1 Intensity m/z 2+ 0.5 Intensity m/z 3+ 0.33 Intensity m/z
What is the Charge State? 713.3225 432.8990 713.8239 433.2330 714.3251 433.5671 714.8263 433.9014 between the isotopes is 0.5 Da between the isotopes is 0.33 Da
Protein Identification by Mass Spectrometry Samples Peptides Mass Spectrometry intensity m/z Identity
Protein Identification - Exercise 1. Protein identification: NUP1 was genomically tagged protein A, affinity purified under two conditions, and the resulting protein mixture was analyzed with liquid chromatography mass spectrometry (LC-MS). Search the resulting spectra (NUP1-less-stringent-wash.mgf, NUP1-more-stringent-wash.mgf) using X! Tandem (http://h.thegpm.org/tandem/thegpm_tandem.html). Change the taxon to “S. cerevisiae (budding yeast)” but otherwise keep the default parameter settings. a. Look at the list of identified proteins and explain why they are found in this sample. More information is also available by selecting the “go”, “path”, “ppi”, “doms”, “string” tabs on top of the page. b. Select the “mh” display on top right of the page, and zoom in to +/-100 ppm (the default setting for the mass accuracy that was used in the search). What precursor mass accuracy should we have used? Zoom in further and determine what precursor mass accuracy could have been used if the spectra were recalibrated (the error distribution centered at zero).
Identification – Tandem MS
Tandem MS – Sequence Confirmation K L E D F G S m/z % Relative Abundance 100 250 500 750 1000
Tandem MS – Sequence Confirmation K L E D F G S K 1166 L 1020 E 907 D 778 663 534 405 F 292 G 145 S 88 b ions m/z % Relative Abundance 100 250 500 750 1000
Tandem MS – Sequence Confirmation K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000
Tandem MS – Sequence Confirmation K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Tandem MS – Sequence Confirmation K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
Tandem MS – Sequence Confirmation K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 113 113
Tandem MS – Sequence Confirmation K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 129 129
Tandem MS – de novo Sequencing 762 100 Amino acid masses 875 [M+2H]2+ % Relative Abundance 633 292 405 260 389 534 1022 504 663 778 907 1020 1080 250 500 750 1000 m/z Mass Differences Sequences consistent with spectrum
Tandem MS – de novo Sequencing
Tandem MS – de novo Sequencing
Tandem MS – de novo Sequencing X X X …GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… Peptide M+H = 1166 1166 -1079 = 87 => S SGF(I/L)EEDE(I/L)… SGF(I/L)EEDE(I/L)… 1166 – 1020 – 18 = 128 K or Q SGF(I/L)EEDE(I/L)(K/Q) …GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… X X X
Tandem MS – de novo Sequencing Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information
Tandem MS – Database Search Sequence DB Lysis Fractionation Pick Protein Digestion LC-MS Pick Peptide Repeat for all proteins MS/MS All Fragment Masses all peptides Repeat for MS/MS Compare, Score, Test Significance
Information Content in a Single Mass Measurement Human 10 8 6 Avg. #of matching peptides 4 3 2 1 2 3 4 6 8 10 1 #of matching peptides 1000 2000 3000 Tryptic peptide mass [Da] S. cerevisiae 10 8 6 Avg. #of matching peptides 4 3 2 1 2 3 4 6 8 10 1 #of matching peptides 1000 2000 3000 Tryptic peptide mass [Da]
Protein Identification and Quantitation by Mass Spectrometry Samples Peptides Mass Spectrometry Quantity intensity m/z
Protein Quantitation by Mass Spectrometry Sample i Protein j Peptide k Lysis Fractionation Digestion MS LC-MS
Quantitation – Label-Free (MS) Sample i Protein j Peptide k Lysis Assumption: constant for all samples Fractionation Digestion LC-MS MS MS
Quantitation – Metabolic Labeling Light Heavy Lysis Fractionation Digestion LC-MS Sample i Protein j Peptide k MS H L Oda et al. PNAS 96 (1999) 6591 Ong et al. MCP 1 (2002) 376
Quantitation – Labeled Synthetic Peptides Assumption: All losses after mixing are identical for the heavy and light isotopes and Lysis Fractionation Digestion Synthetic Peptides (Heavy) Light Enrichment with Peptide antibody LC-MS Anderson, N.L., et al. Proteomics 3 (2004) 235-44 MS H L Gerber et al. PNAS 100 (2003) 6940
Estimating peptide quantity Peak height Peak height Curve fitting Curve fitting Intensity Peak area m/z
What is the best way to estimate quantity? Peak height - resistant to interference - poor statistics Peak area - better statistics - more sensitive to interference Curve fitting - better statistics - needs to know the peak shape - slow Spectrum counting - resistant to interference - easy to implement - poor statistics for low-abundance proteins
Proteomics Informatics - Summary Structure of mass spectrometry data Protein identification Protein quantitation
Next Lecture: Gene Expression
Protein Quantitation - Exercise 2. Protein quantitation: Two breast tumor xenografts (one basal and one luminal) were analyzed in by LC-MS and the spectral counts for the identified peptides in the different analyses are listed in two-sample-three-replicate-comparison.txt. a. Compare replicate one of Sample 1 with replicate one of Sample 2 using proteomics_no_replicate.py. Which differences are significant? b. Compare replicate one and two of Sample 1 using proteomics_one_replicate.py. Compare to the distribution in 2a. Which differences are significant in 2a? c. Compare the three replicates of Sample 1 with the three replicates of Sample 2 using proteomics_three_replicates.py. Which differences are significant? d. In cases when a protein is not observed in one sample, how many spectra do we need to observe in the other sample to say that there is a significant difference?
Phosphorylation Exercise: an unmodified peptide Theoretical fragment ions You could give that as a help to see what changes etc.
Spectrum of the phosphorylated peptide You could give that as a help to see what changes etc.
Spectrum of the peptide phosphorylated at a different site You could give that as a help to see what changes etc.