Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.

Similar presentations


Presentation on theme: "Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology."— Presentation transcript:

1 Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

2 Outline Tandem mass spectrometry overview Peptide identification –De novo –Peptide database search –Spectrum database search Score calibration Statistical confidence estimation

3 Mass spectrometry background

4 EAMPK GDIFYPGYCPDVK LPLENENQGK ASVYNSFVSNGVK YVMTFK ENQGVVNR

5 Peptide fragmentation

6 Peptide fragmentation spectrum m/z Intensity VVVTGLGMLSPVGNTVESTWK +2 1304.4 +1 888.14 +1

7

8 EAMPK EAMPK? Our first goal is to identify each spectrum

9 Peptide identification

10 Two approaches Frank et al. JPR. 2006.

11 The peptide can be inferred using pairs of nearby peaks IYEVEGMR

12 The spectrum graph considers all possible peptides Frank et al. JPR. 2006.

13 What are the pros and cons of the de novo approach? +Dynamic programming can quickly find peptides that fit a given spectrum graph. -Many real spectra don’t yield a very good spectrum graph. +It is possible to match against a noisy spectrum by allowing missing peaks. -If you allow lots of missing peaks, then the DP is slow and leads to many false positive matches. -If instead you search a database, you limit the number of possible false positives. -In practice, the de novo approach does not identify many spectra. +The de novo approach is the only one that can find previously unknown peptides.

14 The database search approach Nesvizhskii et al. Nature Methods. 2007.

15 SEQUEST theoretical spectrum y-ion, charge 1b-ion, charge 2 flanks H 2 O loss NH 3 loss a-ion

16 16 Theoretical Peaks Observed Peaks

17 SEQUEST cross-correlation score Define R i as the scalar product of the two spectra, with one offset by i. The score, called XCorr, is R 0 minus the average R i for i in -75, …, 75.

18 Spectrum Peptide database Spectrum comparison function Theoretical spectrum generator

19 The X!Tandem hyperscore Number of matched b-ions Number of matched y-ions Boolean: Is the peak at m/z value i a b- or y- ion? Intensity of the peak at m/z value i

20 A third approach to identifying spectra 1.De novo identification 2.Search against a database of theoretical spectra 3.Search against a database of previously observed, identified spectra

21 Spectrum Identification Database: fasta file … SEQUEST Peptide ID list >MEKK1 (kinase) MDRILARMKKSTRRGGDKNIT PVRRLERR… >ATMKK5 (kinase kinase) MKPIQSPSGVASPMKNRLRK RPDLSPPLPHRDVALAVLP… MS/MS query spectra ID proteins from peptides… Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… ------------------- Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… ------------------- Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BiblioSpec 3 NGISLTIVR 3 QWDKEPPR 2 FMACSDEK 1 CGCCLYNT 2 GDTIENFK Library of identified spectra

22 Spectrum Identification SEQUEST Peptide ID list MS/MS query spectra Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… ------------------- Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… ------------------- Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BiblioSpec 3 NGISLTIVR 3 QWDKEPPR 2 FMACSDEK 1 CGCCLYNT 2 GDTIENFK Library of identified spectra 765.1 940.4 593.9 300.4 522.3 m/z 594.2 score = 0.2

23 What are the pros and cons of library searching? +Because the spectrum library contains peak intensity information, matching can be done accurately. +Library searching is faster than database searching. ̶ Library searching can only identify peptides that have been previously identified.

24 Database search tools Greylag MASCOT MS-Tag (ProteinProspector) Massmatrix MyriMatch OMSSA Olav Pepfrag (Prowl) Pepprobe Pepsplice Pfind Phenyx ProLuCID (YADA) ProbID RAId_DbS SEQUEST SpectrumMill VEMS X!Tandem Landscape of Peptide Identification Software Spectral matching tools Bibliospec SpectraST X! P3 De novo sequencing tools Lutefisk PEAKS PepNovo Sequit Sequence tag/hybrid approaches ByOnic/LookupPeaks GutenTag Inspect Paragon Popitam Others Proteinlynx SonarMS/MS (knexus) Xproteo

25 Score calibration

26 Searching many spectra yields a set of peptide-spectrum matches PSMs Spectra Peptides

27 Identifying spectra requires solving two distinct problems well PSMs Task 1: Ranking candidate peptides with respect to a single spectrum Task 2: Ranking PSMs with respect to one another

28 Different spectra yield different score distributions

29

30 Different charge states yield different score distributions

31 Estimating a p-value can improve calibration The probability of observing a score >4 is the area under the curve to the right of 4.

32 XCorr scores fit a Weibull distribution (Klammer J Proteome Research 2009)

33 X!Tandem and Comet use a log linear fit XCorr log(count) XCorr of 5 corresponds to count of 10 -5 (Eng J Proteome Research 2009)

34 Calibration improves statistical power to identify spectra Calibrated Uncalibrated FDR threshold Identified spectra

35 Statistical confidence estimation Elias & Gygi Nat Biotech 2007

36 Our second goal is to identify a set of spectra at a given false discovery rate

37 Spectrum identification must account for two types of multiple testing MSEDEIER VDPSSWFNN CSSSTEAEQR CIVGLTK QFIDFSTVFQP ISLSGK ALNDVGK Minimum p-value False discovery rate

38 Decoy peptides can be used to estimate FDR. MSEDEIER ISLSGK CSSSTEAEQR CIVGLTK ALNDVGK Search Decoy Target MSEDEIER 2.2 ISLSGK 1.6 CSSSTEAEQR 1.9 CIVGLTK 2.8 ALNDVGK 2.7 VDPSSWFNN 1.2 QFIDFSTVFQP 1.7 FDR = 1/4 = 25% Elias & Gygi Nat Biotech 2007

39 stage 1square root XCorr can be calculated as a dot product between two spectra Observed spectrum stage 2normalize regions stage 3cross-correlation pre-processing Theoretical spectrum VNIQEELGK for each peptide bond:b ion y ion neutral losses dot product XCorr score (Eng J Proteome Research 2008)

40 XCorr can be refactored to make the theoretical spectrum binary stage 3 observed spectrumtheoretical spectrum fingerprint of b / y / neutral losses centered at m i = 347 vector of cleavage evidence sum of evidence for cleavage at m i = 347 VNIQEELGK binary markers of backbone cleavage dot product refactored Xcorr score (Howbert Molecular & Cellular Proteomics 2014)

41 The refactored XCorr is very similar to the original XCorr ρ=0.995

42 Distribution of XCorr scores can be computed using dynamic programming

43 Each column holds a score distribution for all peptides with a given mass O( m  ( s max – s min )  | AA | ) ~ 1 sec for m = 1500

44 Calibration removes charge state dependency of scores

45 XCorr p-values must be corrected for multiple testing Sidak correction is similar to Bonferroni Accounts for fact that database search considers many peptides for each spectrum

46 The resulting p-values are distributed uniformly log(rank p-value) log(p-value) p-value Frequency

47 Exact calibration improves statistical power to identify spectra yeast-01 MS1 / MS2 low resolution


Download ppt "Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology."

Similar presentations


Ads by Google