Eat Raw & Fresh: Introducing isotopic Mass-to-charge Ratio and Envelope Fingerprinting (iMEF) and ProteinGoggle for Protein Database Search Zhixin(Michael) Tian CNCP 11/15/2012
What is mass? Monoisotopic mass (m/z, z=+1) L. C. Dias, et al. J. Org. Chem. 2012, 77, 4046.
(13C/12C ratio’s variability) Missing monoisotopic mass in protein Monoisotopic mass : most significant & accurate Mass of the most abundant isotope Error: ±1 Da or more (mis-assignment of # of contributing heavy isotopes ) Average mass: Error: ±1 u at 16,000 u (13C/12C ratio’s variability) Monoisotopic mass (12C, 1H, 14N, 16O, 32S) Average mass (average of isotopic peak masses weighted by abundance) The increased probability for multiple heavy isotopes as the mass of a molecule increases causes a decrease in the relative abundance of the monoisotopic peak. The observation of the monoisotopic peak is unlikely for molecules larger than 15 KDa.
Deisotoping (Deconvolution) Algorithms: AID-MS, ESI-ISOCONV, LASSO, MapQuant, MasSPIKE, MATCHING, msInspect, Peplist, quadratic deisotoping, RAPID, THRASH, Wang’s method, Zhang’s program, and ZSCORE Steps: Calculate background noise level Determine charge state using FT/Patterson technique Calculate theoretical profile Fit with observed isotopic profile Monoisotopic mass Search Engines: ProSightPC, SEQUEST, Mascot, X!Tandem, InsPecT, OMSSA, Andromeda, pFind 2. C. D. Wenger, M. T. Boyne, J. T. Ferguson, D. E. Robinson, N. L. Kelleher, Versatile Online-Offline Engine for Automated Acquisition of High-Resolution Tandem Mass Spectra. Anal Chem 80, 8055 (Nov 1, 2008). 3. J. K. Eng, A. L. Mccormack, J. R. Yates, An Approach to Correlate Tandem Mass-Spectral Data of Peptides with Amino-Acid-Sequences in a Protein Database. J Am Soc Mass Spectr 5, 976 (Nov, 1994). 4. D. N. Perkins, D. J. C. Pappin, D. M. Creasy, J. S. Cottrell, Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551 (Dec, 1999). 5. S. Tanner et al., InsPecT: Identification of posttransiationally modified peptides from tandem mass spectra. Anal Chem 77, 4626 (Jul 15, 2005). 6. L. Y. Geer et al., Open mass spectrometry search algorithm. J Proteome Res 3, 958 (Sep-Oct, 2004). 7. J. Cox et al., Andromeda: A Peptide Search Engine Integrated into the MaxQuant Environment. J Proteome Res 10, 1794 (Apr, 2011). 8. D. Q. Li et al., pFind: a novel database-searching software system for automated peptide and protein identification via tandem mass spectrometry. Bioinformatics 21, 3049 (Jul 1, 2005).
Peptide Mass Fingerprinting (PMF) Protein Database RAW File Input MS Spectrum (iE) MS/MS Spectra (iE) A1/P1 A1/P2 A2/P3 Search Engine Parent (Theo. mass) Parent (Exp. mass) A2/P4 Fragments (Theo. mass) Fragments (Exp. mass) Candidates Output Final IDs Initial IDs
Ubiquitin - MS spectrum (profile)
Ubiquitin – MS/MS (ETD) Spectrum (Profile)
Database search with PMF using ProSightPC NMFs = 92 NUMFs = 219 P score = 4.86E-98
Definition of P_Score f - the total number of observed fragments (NMFs + NUMFs); n - the number of matching fragments (NMFs). x - the mean probability that a mass of an observed fragment ion will randomly match one from a generic protein 111.1 - the mass of the average amino acid, weighted for its occurrence in proteins; 2 - the number of fragment ions generated from each bond cleavage, which is assumed to be 2 (b- and y-type ions or c-and z•-type ions); Ma - the mass accuracy (a Ma of ±1 Da translates to a 2 Da window). Neil L. Kelleher, et al. Nat. Biotechnol. 2001, 19, 952
Is “MFs” really good? ?
Is “NUMFs” really good? RAPID (28+49=77) THRASH (92+219=311) PeakPicking: SNRThreshold = 3.0 BackgroundRatio = 5.0 FitType = Lorentzian DeconvPep: MaxCharge = 25 ThScore = 0.0 AdvDeconv: MaxAbundancePeak = 3 ScanNoModifier = 0 MaxMissPeak = 3 MassErr = 1.0E-05 ThClustExt = 0.0 IntsRangeErr = 0.5 Better “deisotoping”? NO “deisotoping”?
What is a mass spectrum? MS of Ubiquitin
The nature of the iE of an ion x, y coordinates Profile Exp. m/z Abundance 856.9821 6061 857.0825 21811 857.1826 52841 857.2809 82342 857.3782 93523 857.4746 96019 857.5714 75857 857.6682 60680 857.7663 42420 857.8669 27294 857.9680 14752 858.0681 5685 858.1685 1120 858.2717 919 858.3671 316 858.4594 147 Centroid
What are in a protein database? MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG x, y coordinates Exp. m/z Abundance 856.9690 3.95 857.0692 18.83 857.1695 45.88 857.2698 76.13 857.3701 96.65 857.4703 100.00 857.5706 87.76 857.6709 67.12 857.7711 45.63 857.8714 27.99 857.9716 15.67 858.0719 8.09 858.1721 3.87 858.2724 1.73 858.3726 0.73 858.4729 0.29 C378H630N105O118S1 Centroid
iMEF(isotopic m/z & Envelope Fingerprinting) Protein Database RAW File Input A2/P3 A2/P4 Parent (Theo. mass) Fragments Parent (Theo. iE) Fragments A/P1 A/P2 MS Spectrum (iE) MS/MS Spectra (iE) A1/P1 Parent (Exp. mass) Fragments A1/P2 Search Candidates Output Final IDs Initial IDs
Top-down Screening – MS/MS2 ( Targeted Screening - MS2) 1st isotopic peak DB A1/F1 Parent ion exp. iE Parent ion theo. iE A2 F2 Protein candidates Fragment ion exp. iEs Fragment ion theo. iEs A2/F3 Preliminary protein IDs 2nd isotopic peak Y 3rd isotopic peak Initial protein ID NMFs PTM_Scores Initial protein IDs Final IDs Remove duplicates Isotopic peak exclusion list Norm. isotopic peaks removed N Combined initial protein IDs Preliminary protein candidates N Top-down Screening – MS/MS2 ( Targeted Screening - MS2) N iMEF = iMF (A1) + iEF (A2) Y Y Y N
Pre-Step 1: Customized database MS Precursor ions MS/MS fragment ions
Pre-Step 2: Noise level determination
Ubiquitin - MS spectrum (profile)
Ubiquitin – MS/MS (HCD) spectrum (profile)
Step 1: Profile to centroid (MS & MS2)
isolation window (±3 m/z units) Step 2: iMF of precursor ion candidates 857.47461 (4 ppm) Top-down Screening IPMD 15 ppm isolation window (±3 m/z units) … … … … … …
Step 3: iEF of precursor ion candidates IPACO 5% IPMD 15ppm IPAD 30%
Targeted Screening IPMD 10 ppm Step 4: iMF of fragment ion candidates Targeted Screening IPMD 10 ppm 277.13278 (5 ppm) C1;MAX_MZ=149.07431&C2;MAX_MZ=277.132888&C3;MAX_MZ=390.216952&C4;MAX_MZ=537.285366&C5;MAX_MZ=636.353779&C6;MAX_MZ=764.448743&C7;…
Step 5: iEF of fragment ion candidates IPACO 5% IPMD 10ppm IPAD 50%
Exemplary PTM_Score assignment Human histone H4_S1acK16acK20me2
IPMDO=20, IPMDOM=30, IPADO=20, IPADOM=200 ID of ubiquitin from ETD NMFs = 91 IPACO=10, IPMD=15, IPAD=100 IPMDO=20, IPMDOM=30, IPADO=20, IPADOM=200 NMFs vs. IPACO NMFs vs. IPMD NMFs vs. IPAD
Pros and Cons Pros: As-strict-as-you-choose confidence Strict quality control (QC) Fine discrimination of close iEs In-situ unwrapping of overlapped iEs Cons: More complex and bigger database More data points for fingerprinting
Pros: As-strict-as-you-choose confidence Comparison with ProSightPC
Layman’s choice of parameters Default values with statistical significance!
Pros: Fine discrimination of close iEs b38-533+ b18-333+ or b19-343+ (b6-22-H2O)3+ Exp. m/z Theo. m/z IPMD 599.6575 599.6478 16 599.6511 11 599.6595 -3 599.9919 599.9821 599.9855 599.9939 600.3242 600.3164 13 600.3197 8 600.3281 -6 600.6616 600.6506 18 600.6539 600.6623 -1
Pros: In-situ unwrapping of overlapped iEs The abundance of an overlapped isotopic peak is divided into individual overlapped isotopic envelopes according to the calculated proportional abundance using the experimental abundance and theoretical relative abundance ratios Proportional partition k: # of overlapped isotopic peaks m: # of isotopic peak in each iE n: # of overlapped iEs
Other improvements and utilities Bi-section method for fast indexing of candidates LASSO-like approach to untangle overlapped iEs Additional utilities: A comprehensive confidence score False discovery rate (FDR) Customized ion types to look for new dissociation channels Customized MODs for the search of new modification or labeled proteins MS/MS spectrum annotation with matching fragments
Conclusions An as-confident-as-you-choose protein database search algorithm, iMEF, has been created and implemented in the search engine ProteinGoggle The principle of iMEF with ProteinGoggle is demonstrated with identification of ubiquitin from its tandem mass spectrum using ETD iMEF as implemented in ProteinGoggle has been able to unwrap complex overlapping isotopic envelopes and confidently provide embedded fragment ions iMEF could be adapted for peptide and glycan database search with customized databases
Acknowledgements DNL2003 Li Li Bo Wang Jing Li Xu Zhao The KENES. Co. Ltd. Miao Zhou Shijin Liu Bin Yang Funding: DICP “Research Start” China “Youth 1000-talents Theme”
Thank you very much!