Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology
Outline Tandem mass spectrometry overview Peptide identification –De novo –Peptide database search –Spectrum database search Score calibration Statistical confidence estimation
Mass spectrometry background
EAMPK GDIFYPGYCPDVK LPLENENQGK ASVYNSFVSNGVK YVMTFK ENQGVVNR
Peptide fragmentation
Peptide fragmentation spectrum m/z Intensity VVVTGLGMLSPVGNTVESTWK
EAMPK EAMPK? Our first goal is to identify each spectrum
Peptide identification
Two approaches Frank et al. JPR
The peptide can be inferred using pairs of nearby peaks IYEVEGMR
The spectrum graph considers all possible peptides Frank et al. JPR
What are the pros and cons of the de novo approach? +Dynamic programming can quickly find peptides that fit a given spectrum graph. -Many real spectra don’t yield a very good spectrum graph. +It is possible to match against a noisy spectrum by allowing missing peaks. -If you allow lots of missing peaks, then the DP is slow and leads to many false positive matches. -If instead you search a database, you limit the number of possible false positives. -In practice, the de novo approach does not identify many spectra. +The de novo approach is the only one that can find previously unknown peptides.
The database search approach Nesvizhskii et al. Nature Methods
SEQUEST theoretical spectrum y-ion, charge 1b-ion, charge 2 flanks H 2 O loss NH 3 loss a-ion
16 Theoretical Peaks Observed Peaks
SEQUEST cross-correlation score Define R i as the scalar product of the two spectra, with one offset by i. The score, called XCorr, is R 0 minus the average R i for i in -75, …, 75.
Spectrum Peptide database Spectrum comparison function Theoretical spectrum generator
The X!Tandem hyperscore Number of matched b-ions Number of matched y-ions Boolean: Is the peak at m/z value i a b- or y- ion? Intensity of the peak at m/z value i
A third approach to identifying spectra 1.De novo identification 2.Search against a database of theoretical spectra 3.Search against a database of previously observed, identified spectra
Spectrum Identification Database: fasta file … SEQUEST Peptide ID list >MEKK1 (kinase) MDRILARMKKSTRRGGDKNIT PVRRLERR… >ATMKK5 (kinase kinase) MKPIQSPSGVASPMKNRLRK RPDLSPPLPHRDVALAVLP… MS/MS query spectra ID proteins from peptides… Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BiblioSpec 3 NGISLTIVR 3 QWDKEPPR 2 FMACSDEK 1 CGCCLYNT 2 GDTIENFK Library of identified spectra
Spectrum Identification SEQUEST Peptide ID list MS/MS query spectra Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BiblioSpec 3 NGISLTIVR 3 QWDKEPPR 2 FMACSDEK 1 CGCCLYNT 2 GDTIENFK Library of identified spectra m/z score = 0.2
What are the pros and cons of library searching? +Because the spectrum library contains peak intensity information, matching can be done accurately. +Library searching is faster than database searching. ̶ Library searching can only identify peptides that have been previously identified.
Database search tools Greylag MASCOT MS-Tag (ProteinProspector) Massmatrix MyriMatch OMSSA Olav Pepfrag (Prowl) Pepprobe Pepsplice Pfind Phenyx ProLuCID (YADA) ProbID RAId_DbS SEQUEST SpectrumMill VEMS X!Tandem Landscape of Peptide Identification Software Spectral matching tools Bibliospec SpectraST X! P3 De novo sequencing tools Lutefisk PEAKS PepNovo Sequit Sequence tag/hybrid approaches ByOnic/LookupPeaks GutenTag Inspect Paragon Popitam Others Proteinlynx SonarMS/MS (knexus) Xproteo
Score calibration
Searching many spectra yields a set of peptide-spectrum matches PSMs Spectra Peptides
Identifying spectra requires solving two distinct problems well PSMs Task 1: Ranking candidate peptides with respect to a single spectrum Task 2: Ranking PSMs with respect to one another
Different spectra yield different score distributions
Different charge states yield different score distributions
Estimating a p-value can improve calibration The probability of observing a score >4 is the area under the curve to the right of 4.
XCorr scores fit a Weibull distribution (Klammer J Proteome Research 2009)
X!Tandem and Comet use a log linear fit XCorr log(count) XCorr of 5 corresponds to count of (Eng J Proteome Research 2009)
Calibration improves statistical power to identify spectra Calibrated Uncalibrated FDR threshold Identified spectra
Statistical confidence estimation Elias & Gygi Nat Biotech 2007
Our second goal is to identify a set of spectra at a given false discovery rate
Spectrum identification must account for two types of multiple testing MSEDEIER VDPSSWFNN CSSSTEAEQR CIVGLTK QFIDFSTVFQP ISLSGK ALNDVGK Minimum p-value False discovery rate
Decoy peptides can be used to estimate FDR. MSEDEIER ISLSGK CSSSTEAEQR CIVGLTK ALNDVGK Search Decoy Target MSEDEIER 2.2 ISLSGK 1.6 CSSSTEAEQR 1.9 CIVGLTK 2.8 ALNDVGK 2.7 VDPSSWFNN 1.2 QFIDFSTVFQP 1.7 FDR = 1/4 = 25% Elias & Gygi Nat Biotech 2007
stage 1square root XCorr can be calculated as a dot product between two spectra Observed spectrum stage 2normalize regions stage 3cross-correlation pre-processing Theoretical spectrum VNIQEELGK for each peptide bond:b ion y ion neutral losses dot product XCorr score (Eng J Proteome Research 2008)
XCorr can be refactored to make the theoretical spectrum binary stage 3 observed spectrumtheoretical spectrum fingerprint of b / y / neutral losses centered at m i = 347 vector of cleavage evidence sum of evidence for cleavage at m i = 347 VNIQEELGK binary markers of backbone cleavage dot product refactored Xcorr score (Howbert Molecular & Cellular Proteomics 2014)
The refactored XCorr is very similar to the original XCorr ρ=0.995
Distribution of XCorr scores can be computed using dynamic programming
Each column holds a score distribution for all peptides with a given mass O( m ( s max – s min ) | AA | ) ~ 1 sec for m = 1500
Calibration removes charge state dependency of scores
XCorr p-values must be corrected for multiple testing Sidak correction is similar to Bonferroni Accounts for fact that database search considers many peptides for each spectrum
The resulting p-values are distributed uniformly log(rank p-value) log(p-value) p-value Frequency
Exact calibration improves statistical power to identify spectra yeast-01 MS1 / MS2 low resolution