1 Deriving statistical models for predicting MS/MS product ion intensities Terry Speed & Frédéric Schütz Division of Genetics & Bioinformatics The Walter.

1 Deriving statistical models for predicting MS/MS product ion intensities Terry Speed & Frédéric Schütz Division of Genetics & Bioinformatics The Walter and Eliza Hall Institute of Medical Research In collaboration with the Joint ProteomicS Laboratory (WEHI/LICR)

2 Introduction Proteomics is critical to our understanding of cellular biological processes Mass Spectrometry (MS) has emerged as a key platform in proteomics for the high-throughput identification of proteins Sophisticated algorithms, such as Mascot or Sequest, exist for database searching of MS/MS data Major bottleneck: results must often be manually validated More robust algorithms are needed before the identification of MS/MS data can be fully automated

3 m/z Ionisation molecular weight = 600 Da abundance = 50 % molecular weight = 400 Da abundance = 20 % molecular weight = 300 Da abundance = 30 % 601 401301 Detection What is a Mass Spectrometer ? 50 30 20 Separation + + + + + + + + + + ++ + + ++ + + + ++ + + + + ++ ++ + ++ + + ++ +++ + “An analytical device that determines the molecular weight of chemical compounds by separating molecular ions according to their mass-to-charge ratio (m/z)” by m/z

4 m/z Ionisation molecular weight = 600 Da abundance = 50 % molecular weight = 400 Da abundance = 20 % molecular weight = 300 Da abundance = 30 % 601 401301 Detection 50 30 10 Separation + + + + + + ++ + + + ++ + + + + + ++ + + + + ++ + + + 201 by m/z 20 ++ + ++ + + + +

5 Tandem MS (MS/MS) To gain structural information about the detected masses: –different molecules of the same substance can split in different ways. –in each molecule, only the pieces that retain one of the charges will be observed and present in the spectrum; the others are discarded. + + + + collision... separation & detection ++ ++ with a gas Second MS one product is selected

6 How to use MS for protein identification Peptide mass fingerprinting The exact protein needs to be in the database Works only with single protein fragmentations 2D-GEL DIGEST EXCISE Proteins Sample MS m/z Example: peaks at m/z 333, 336, 406, 448, 462, 889 The only protein in the database that would produce these peaks is MALK|CGIR|GGSRPFLR|ATSK|ASR|SDD

7 2-D Gel (or 1-D Gel) In-gel Digest (Trypsin) MS Analysis (ESI Ion Trap) MS data Capillary Column RP-HPLC (On-line; 60min Gradient) + + + + + + + + - - - - - - - + + + + + + + + + + + - - - - - - - Original Droplet Solvent Evaporates From Droplet Positive Ions CID m/z (Most intense ion) MS/MS data Tandem MS for protein identification CID = Collision Induced Dissociation

8 150200250300350400450500550600650700750800850900950 m/z 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Relative Abundance 205.0 219.0 247.0 248.1 262.1 304.0 305.1 391.1 417.2 418.1 506.2 530.2 619.2 645.3 732.2 774.4 789.3 889.4 936.4 937.4 318.1 372.2 431.1 468.4 904.5 y8y8 y7y7 y6y6 y5y5 y4y4 y3y3 y2y2 b2b2 b3b3 b4b4 b5b5 b6b6 b7b7 b8b8 a2a2 Glu Asp Lxx Gly Phe Val Phe Gly Lxx Lxx Asp Glu Asp Lys b8b8 b7b7 b6b6 b5b5 b2b2 b4b4 b3b3 y8y8 y7y7 y6y6 y5y5 y4y4 y3y3 y2y2 Example MS/MS spectrum Tryptic fragment:

9 Interpretation of MS/MS data Direct interpretation ("de novo sequencing") –spectrum must be of good quality –the only identification method if the spectrum is not in the database –can give useful information (partial sequence) for database search General approach for database searching: –extract from the database all peptides that have the same mass as the precursor ion of the uninterpreted spectrum –compare each of them them to the uninterpreted spectrum –select the peptide that is most likely to have produced the observed data MASCOT: –simple probabilistic model –calculate the probability that a peptide could have produced the given spectrum by chance

10 Interpretation of MS/MS data SEQUEST: –generate a predicted spectrum for each potential peptide using a simple fragmentation model (all b and y ions have the same intensity; possible losses from b and y have a lower intensity) –compute a "cross-correlation" score and find the best-matching peptide –since this operation is very time-consuming, a simpler preliminary score is used to find the 500 peptides in the database that are most likely to be the correct identification

11 MASCOT correct sequence is the 2nd scoring peptide SEQUEST correct sequence is not in the top 10 scoring peptides An unusual spectrum

12 Intermediate conclusions All current MS/MS database search algorithms use a simplified fragmentation model: "peptides fragment in an uniform manner under low-energy collision induced dissociation (CID) conditions" This approach works well for identifying most peptides Several peptides exhibit fragment ions that differ greatly from this simple model Those peptides often yield low or insignificant scores, thus preventing a positive identification A better understanding of the fragmentation of peptides in the gas phase is required to build more robust search engines.

13 How does a peptide fragment ? Peptides usually fragment at their amide (=peptide) bond, producing b and y ions ‘mobile proton’ hypothesis: cleavage is initiated by migration of the charge from the initial site of protonation aRginine (a very basic residue) can sequester a proton Other basic residues (Lys, His) can hinder proton mobility If no mobile proton is available: –peptide will usually fragment poorly –other fragmentation mechanisms take precedence cleavage at Asp-Xaa = cD  (which we saw two slides back) cleavage at Glu-Xaa = cE  VLSIGDGIAR + +

14 Fragmentation example VFIMDNCEELIPEYLNFIR ox Pe y8y8 y6y6 y5y5 y4y4 y9y9 y8y8 b 10 b 11 nP cleavage cP cleavage + + Pe (pyridylethyl cysteine) = loss from C; ox = metox (methionine sulfoxide) = loss from M

15 Fragmentation example, II -CH 3 SOH RVFIMDNCEELIPEYLNFIR ox Pe y 14 b6b6 -Pe - (CH 3 SOH + Pe) y 14 y 11 ~ ~ y6y6 b6b6 MDNCE metox loss Pe loss cD cleavage cE cleavage nP cleavage + + y8y8 y 11 y 10 y6y6 Difficult to interpret due to N- and C-terminal aRginines.

16 Factors influencing fragmentation Some factors have been known for a long time: –Xaa-Pro (nP) cleavage usually enhanced –Asp-Xaa (cD) enhanced when no mobile proton is available Several recent attempts to improve this knowledge Concentrated only on small subsets of data –Breci et al. database of 168 Pro-containing peptides analyse fragmentation at the Xaa-Pro (nP) bond most abundant ions observed when Xaa is Val, His, Asp, Ile and Leu –Tabb et al. determined if residues are more likely to cleave on their N rather than their C-terminal –Huang et al. analysis of 505 doubly-charged tryptic peptides cleavage at Asp-Xaa (cD) is more prominent for peptides that also contain an internal histidine residue

17 Find factors influencing fragmentation Data: –about 11,000 spectra from an Ion-Trap mass spectrometer –identified using SEQUEST –manually validated to ensure correct identification 5,500 unique sequences Preliminary calculations: Cleavage Intensity Ratios (CIR) CIR AverageEnhanced < 1= 1> 1 CleavageReduced

18 Quantifying the Asp-Xaa (cD) bond cleavage Mobile Partially-Mobile Non-Mobile 1+5.10 (126)- 2+ K1K1 R1R1 0.81 (358) 1.04 (316) R1R1 4.96 (92)R2R2 3+ H1K1H1K1 K1R1K1R1 0.88 (54) 0.91 (37) 3.63 (12)R3R3 1.31 (24)R2R2 2.37 (238) 1.66 (276) 2.06 (301) K1K1 K2K2 K1R1K1R1 1.63 (79) 2.51 (21) H1K1R1H1K1R1 K1R2K1R2 1.94 (23) 2.71 (10) H1K2R1H1K2R1 H1K1R2H1K1R2 If number of Arg residues ≥ number of charges Non-mobile If number of Arg residues ≥ number of charges Non-mobile If number of Arg, Lys & His < number of charges Mobile otherwise they are designated Partially-mobile otherwise they are designated Partially-mobile ‘Relative Proton Mobility’ Scale Entries: average CIR (#peptides), stratified by # basic residues

19 Influence on scoring Already known: The charge state has an influence on search scores Proton mobility also influences search scores Dashed line: Currently accepted cut-off; below not identified w/o manual interv.

20 Find factors influencing fragmentation,II Data categorized into 9 different strata, according to –charge state (1, 2 or 3+) –‘relative proton mobility’ scale Each spectrum was individually normalised

21 Find factors influencing fragmentation,III Intensity at cleavage Xaa-Yaa is modeled by: log(intensity of the cleavage) = baseline cleavage intensity + increase/decrease due to residue on C-term (Xaa) + increase/decrease due to residue on N-term (Yaa) +  (pos) +  (pos 2 ) +  log 2 (peptide length) where –intensity of the cleavage = sum of intensities of all ions (b, y, etc) produced by cleavage at this bond –baseline cleavage intensity = average cleavage intensity if no factor has a special effect on fragmentation –increase/decrease = indicator variables –pos = relative position of the cleavage inside the peptide (0..1) –log(peptide length) = accounts for the lower intensity, due to the normalisation process, of a given cleavage when it occurs in a longer peptide

22 Find factors influencing fragmentation, IV Linear regression is performed to estimate the effect of each of these variables on the fragmentation process Variable selection: ensure that only variables that have a real effect on the fragmentation process are retained –for each "side" (C or N), the factor that is the closest to the average intensity is removed from the model. In other words, one of the residues of each side is selected as the reference, the residue that "does nothing" –backward selection is then performed to remove all variables that are not significantly different from 0 (at the 1% level)

23 How to find factors influencing frag The regression was always significant (i.e. at least one factor was significant) In practice: –the pos and log(length) terms were always retained –in each regression, several residues were selected

24 Factors influencing fragmentation

25 Predicting ion intensities Use the same kind of linear model as before Fit separate models for the different types of ions that we want to predict Currently, only b and y ions are predicted Influence of residues and positional factors are taken into account for the prediction This (and everything before) is valid only on an Ion-Trap mass spectrometer

26 Prediction example : LEGLTDEINFLR, 1+ Observed spectrum SEQUEST prediction Prediction with LM ‘non-mobile’ peptide, which usually gives bad scores correlation between observed and LM predicted spectrum: 0.97

27 Testing our predictions Predictions were tested on a set of 283 peptides not used for fitting the model correlation between predicted and observed spectrum: median: 0.73, interquartile range: 0.27

28 Testing our predictions, II Worst scoring peptide (correlation = -0.19): RAELEAK, doubly-charged Explanation –Most peptides in the training set are tryptic peptides –Proton will usually sit at the C-terminal of the peptide (K) –Under this assumption, y-ions are usually more intense than b-ions –Because of the miscleavage, the proton actually sits at the N- terminal –Consequently, b-ions are more intense than y-ions –The model performs badly Charge localisation should be taken into account

29 Ongoing work More known effects (e.g. charge localisation) must be taken into account in the model, plus some interactions Other effects, still unknown, also have an influence on the fragmentation, and should be looked for Predict other ion series (neutral losses, etc) Test if the predictions can help discriminate between correct and incorrect identifications Build a new search algorithm that takes into account these predictive models

30 Conclusions Prediction of spectra is becoming feasible Better search algorithms are expected The ‘relative proton mobility’ scale helps the interpretation of database search scores Optimized thresholds can be used for different subsets of the data It should improve the sensitivity and specificity of the identification process These are important steps towards fully automated identification of peptide MS/MS data

31 Acknowledgments JPSL Ludwig Institute, Melbourne –Eugene Kapp –James Eddes –Gavin Reid –Lisa Connolly –David Frecklington –Robert Moritz –Richard Simpson Bioinformatics, WEHI –Frédéric Schütz Dept. of Chemistry, Melbourne University –Richard O ’Hair Part of this work will appear in Analytical Chemistry

1 Deriving statistical models for predicting MS/MS product ion intensities Terry Speed & Frédéric Schütz Division of Genetics & Bioinformatics The Walter.

Similar presentations

Presentation on theme: "1 Deriving statistical models for predicting MS/MS product ion intensities Terry Speed & Frédéric Schütz Division of Genetics & Bioinformatics The Walter."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Deriving statistical models for predicting MS/MS product ion intensities Terry Speed & Frédéric Schütz Division of Genetics & Bioinformatics The Walter.

Similar presentations

Presentation on theme: "1 Deriving statistical models for predicting MS/MS product ion intensities Terry Speed & Frédéric Schütz Division of Genetics & Bioinformatics The Walter."— Presentation transcript:

Similar presentations

About project

Feedback