Protein Sequencing and Identification by Mass Spectrometry
Masses of Amino Acid Residues
Protein Backbone H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH R i-1 RiRi R i+1 AA residue i-1 AA residue i AA residue i+1 N-terminus C-terminus
Peptide Fragmentation Peptides tend to fragment along the backbone. Fragments can also loose neutral chemical groups like NH 3 and H 2 O. H...-HN-CH-CO... NH-CH-CO-NH-CH-CO-…OH R i-1 RiRi R i+1 H+H+ Prefix FragmentSuffix Fragment Collision Induced Dissociation
Breaking Protein into Peptides and Peptides into Fragment Ions Proteases, e.g. trypsin, break protein into peptides. A Tandem Mass Spectrometer further breaks the peptides down into fragment ions and measures the mass of each piece. Mass Spectrometer accelerates the fragmented ions; heavier ions accelerate slower than lighter ones. Mass Spectrometer measure mass/charge ratio of an ion.
N- and C-terminal Peptides N-terminal peptides C-terminal peptides
Terminal peptides and ion types Peptide Mass (D) = 415 Peptide Mass (D) – 18 = 397 without
N- and C-terminal Peptides N-terminal peptides C-terminal peptides
N- and C-terminal Peptides N-terminal peptides C-terminal peptides
N- and C-terminal Peptides
N- and C-terminal Peptides Reconstruct peptide from the set of masses of fragment ions (mass-spectrum)
Peptide Fragmentation y3y3 b2b2 y2y2 y1y1 b3b3 a2a2 a3a3 HO NH 3 + | | R 1 O R 2 O R 3 O R 4 | || | || | || | H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH | | | | | | | H H H H H H H b2-H2Ob2-H2O y 3 -H 2 O b 3 - NH 3 y 2 - NH 3
Mass Spectra GVDLK mass 0 57 Da = ‘G’ 99 Da = ‘V’ L K DVG The peaks in the mass spectrum: Prefix Fragments with neutral losses (-H 2 O, -NH 3 ) Noise and missing peaks. and Suffix Fragments. D H2OH2O
Protein Identification with MS/MS GVDLK mass 0 Intensity mass 0 MS/MS Peptide Identification:
Tandem Mass-Spectrometry
Breaking Proteins into Peptides peptides MPSER …… GTDIMR PAKID …… HPLC To MS/MS MPSERGTDIMRPAKID protein
Mass Spectrometry Matrix-Assisted Laser Desorption/Ionization (MALDI) From lectures by Vineet Bafna (UCSD)
Tandem Mass Spectrometry Scan 1708 LC Scan 1707 MS MS/MS Ion Source MS-1 collision cell MS-2
Protein Identification by Tandem Mass Spectrometry Sequence MS/MS instrument Database search Sequest de Novo interpretation Sherenga
Tandem Mass Spectrum Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal peptides Spectrum consists of different ion types because peptides can be broken in several places. Chemical noise often complicates the spectrum. Represented in 2-D: mass/charge axis vs. intensity axis
De Novo vs. Database Search W R A C V G E K D W L P T L T W R A C V G E K D W L P T L T De Novo AVGELTK Database Search Database of all peptides = 20 n AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE, AAAAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI, AVGELTI, AVGELTK, AVGELTL, AVGELTM, YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY Database of known peptides MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. Mass, Score
De Novo vs. Database Search: A Paradox The database of all peptides is huge ≈ O(20 n ). The database of all known peptides is much smaller ≈ O(10 8 ). However, de novo algorithms can be much faster, even though their search space is much larger! A database search scans all peptides in the database of all known peptides search space to find best one. De novo eliminates the need to scan database of all peptides by modeling the problem as a graph search.
De novo Peptide Sequencing Sequence
Theoretical Spectrum
Theoretical Spectrum (cont’d)
Building Spectrum Graph How to create vertices (from masses) How to create edges (from mass differences) How to score paths How to find best path
S E Q U E N C E b Mass/Charge (M/Z)
a S E Q U E N C E
Mass/Charge (M/Z) a is an ion type shift in b
y Mass/Charge (M/Z) E C N E U Q E S
Mass/Charge (M/Z) Intensity
Intensity
noise Mass/Charge (M/Z)
MS/MS Spectrum Mass/Charge (M/z) Intensity
Some Mass Differences between Peaks Correspond to Amino Acids s s s e e e e e e e e q q q u u u n n n e c c c
Ion Types Some masses correspond to fragment ions, others are just random noise Knowing ion types Δ={δ 1, δ 2,…, δ k } lets us distinguish fragment ions from noise We can learn ion types δ i and their probabilities q i by analyzing a large test sample of annotated spectra.
Example of Ion Type Δ={δ 1, δ 2,…, δ k } Ion types {b, b-NH 3, b-H 2 O} correspond to Δ={0, 17, 18} *Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity
Match between Spectra and the Shared Peak Count The match between two spectra is the number of masses (peaks) they share (Shared Peak Count or SPC) In practice mass-spectrometrists use the weighted SPC that reflects intensities of the peaks Match between experimental and theoretical spectra is defined similarly
Peptide Sequencing Problem Goal: Find a peptide with maximal match between an experimental and theoretical spectrum. Input: S: experimental spectrum Δ : set of possible ion types m: parent mass Output: P: peptide with mass m, whose theoretical spectrum matches the experimental S spectrum the best
Vertices of Spectrum Graph Masses of potential N-terminal peptides Vertices are generated by reverse shifts corresponding to ion types Δ={δ 1, δ 2,…, δ k } Every N-terminal peptide can generate up to k ions m-δ 1, m-δ 2, …, m-δ k Every mass s in an MS/MS spectrum generates k vertices V(s) = {s+δ 1, s+δ 2, …, s+δ k } corresponding to potential N-terminal peptides Vertices of the spectrum graph: {initial vertex} V(s 1 ) V(s 2 ) ... V(s m ) {terminal vertex}
Reverse Shifts Two peaks b-H 2 O and b are given by the Mass Spectrum With a +H 2 O shift, if two peaks coincide that is a possible vertex. Mass/Charge (M/Z) Intensity Red: Mass Spectrum Blue: shift (+H 2 O) b/b-H 2 O+H 2 O b-H 2 O b+H 2 O
Reverse Shifts Shift in H 2 O+NH 3 Shift in H 2 O
Edges of Spectrum Graph Two vertices with mass difference corresponding to an amino acid A: Connect with an edge labeled by A Gap edges for di- and tri-peptides
Paths Path in the labeled graph spell out amino acid sequences There are many paths, how to find the correct one? We need scoring to evaluate paths
Path Score p(P,S) = probability that peptide P produces spectrum S= {s 1,s 2,…s q } p(P, s) = the probability that peptide P generates a peak s Scoring = computing probabilities p(P,S) = π s є S p(P, s)
For a position t that represents ion type d j : q j, if peak is generated at t p(P,s t ) = 1-q j, otherwise Peak Score
Peak Score (cont’d) For a position t that is not associated with an ion type: q R, if peak is generated at t p R (P,s t ) = 1-q R, otherwise q R = the probability of a noisy peak that does not correspond to any ion type
Finding Optimal Paths in the Spectrum Graph For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P: Peptides = paths in the spectrum graph P’ = the optimal path in the spectrum graph
Ions and Probabilities Tandem mass spectrometry is characterized by a set of ion types {δ 1,δ 2,..,δ k } and their probabilities {q 1,...,q k } δ i -ions of a partial peptide are produced independently with probabilities q i
Ions and Probabilities A peptide has all k peaks with probability and no peaks with probability A peptide also produces a ``random noise'' with uniform probability q R in any position.
Ratio Test Scoring for Partial Peptides Incorporates premiums for observed ions and penalties for missing ions. Example: for k=4, assume that for a partial peptide P’ we only see ions δ 1,δ 2,δ 4. The score is calculated as:
Scoring Peptides T- set of all positions. T i ={t δ1,, t δ2,...,,t δk, }- set of positions that represent ions of partial peptides P i. A peak at position t δj is generated with probability q j. R=T- U T i - set of positions that are not associated with any partial peptides (noise).
Probabilistic Model For a position t δj T i the probability p(t, P,S) that peptide P produces a peak at position t. Similarly, for t R, the probability that P produces a random noise peak at t is:
Probabilistic Score For a peptide P with n amino acids, the score for the whole peptides is expressed by the following ratio test:
De Novo vs. Database Search W R A C V G E K D W L P T L T W R A C V G E K D W L P T L T De Novo AVGELTK Database Search Database of known peptides MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN..