Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.

Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Outline Tandem mass spectrometry overview Peptide identification –De novo –Peptide database search –Spectrum database search Score calibration Statistical confidence estimation

Mass spectrometry background

EAMPK GDIFYPGYCPDVK LPLENENQGK ASVYNSFVSNGVK YVMTFK ENQGVVNR

Peptide fragmentation

Peptide fragmentation spectrum m/z Intensity VVVTGLGMLSPVGNTVESTWK +2 1304.4 +1 888.14 +1

EAMPK EAMPK? Our first goal is to identify each spectrum

Peptide identification

Two approaches Frank et al. JPR. 2006.

The peptide can be inferred using pairs of nearby peaks IYEVEGMR

The spectrum graph considers all possible peptides Frank et al. JPR. 2006.

What are the pros and cons of the de novo approach? +Dynamic programming can quickly find peptides that fit a given spectrum graph. -Many real spectra don’t yield a very good spectrum graph. +It is possible to match against a noisy spectrum by allowing missing peaks. -If you allow lots of missing peaks, then the DP is slow and leads to many false positive matches. -If instead you search a database, you limit the number of possible false positives. -In practice, the de novo approach does not identify many spectra. +The de novo approach is the only one that can find previously unknown peptides.

The database search approach Nesvizhskii et al. Nature Methods. 2007.

SEQUEST theoretical spectrum y-ion, charge 1b-ion, charge 2 flanks H 2 O loss NH 3 loss a-ion

16 Theoretical Peaks Observed Peaks

SEQUEST cross-correlation score Define R i as the scalar product of the two spectra, with one offset by i. The score, called XCorr, is R 0 minus the average R i for i in -75, …, 75.

Spectrum Peptide database Spectrum comparison function Theoretical spectrum generator

The X!Tandem hyperscore Number of matched b-ions Number of matched y-ions Boolean: Is the peak at m/z value i a b- or y- ion? Intensity of the peak at m/z value i

A third approach to identifying spectra 1.De novo identification 2.Search against a database of theoretical spectra 3.Search against a database of previously observed, identified spectra

Spectrum Identification Database: fasta file … SEQUEST Peptide ID list >MEKK1 (kinase) MDRILARMKKSTRRGGDKNIT PVRRLERR… >ATMKK5 (kinase kinase) MKPIQSPSGVASPMKNRLRK RPDLSPPLPHRDVALAVLP… MS/MS query spectra ID proteins from peptides… Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… ------------------- Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… ------------------- Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BiblioSpec 3 NGISLTIVR 3 QWDKEPPR 2 FMACSDEK 1 CGCCLYNT 2 GDTIENFK Library of identified spectra

Spectrum Identification SEQUEST Peptide ID list MS/MS query spectra Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… ------------------- Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… ------------------- Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BiblioSpec 3 NGISLTIVR 3 QWDKEPPR 2 FMACSDEK 1 CGCCLYNT 2 GDTIENFK Library of identified spectra 765.1 940.4 593.9 300.4 522.3 m/z 594.2 score = 0.2

What are the pros and cons of library searching? +Because the spectrum library contains peak intensity information, matching can be done accurately. +Library searching is faster than database searching. ̶ Library searching can only identify peptides that have been previously identified.

Database search tools Greylag MASCOT MS-Tag (ProteinProspector) Massmatrix MyriMatch OMSSA Olav Pepfrag (Prowl) Pepprobe Pepsplice Pfind Phenyx ProLuCID (YADA) ProbID RAId_DbS SEQUEST SpectrumMill VEMS X!Tandem Landscape of Peptide Identification Software Spectral matching tools Bibliospec SpectraST X! P3 De novo sequencing tools Lutefisk PEAKS PepNovo Sequit Sequence tag/hybrid approaches ByOnic/LookupPeaks GutenTag Inspect Paragon Popitam Others Proteinlynx SonarMS/MS (knexus) Xproteo

Score calibration

Searching many spectra yields a set of peptide-spectrum matches PSMs Spectra Peptides

Identifying spectra requires solving two distinct problems well PSMs Task 1: Ranking candidate peptides with respect to a single spectrum Task 2: Ranking PSMs with respect to one another

Different spectra yield different score distributions

Different charge states yield different score distributions

Estimating a p-value can improve calibration The probability of observing a score >4 is the area under the curve to the right of 4.

XCorr scores fit a Weibull distribution (Klammer J Proteome Research 2009)

X!Tandem and Comet use a log linear fit XCorr log(count) XCorr of 5 corresponds to count of 10 -5 (Eng J Proteome Research 2009)

Calibration improves statistical power to identify spectra Calibrated Uncalibrated FDR threshold Identified spectra

Statistical confidence estimation Elias & Gygi Nat Biotech 2007

Our second goal is to identify a set of spectra at a given false discovery rate

Spectrum identification must account for two types of multiple testing MSEDEIER VDPSSWFNN CSSSTEAEQR CIVGLTK QFIDFSTVFQP ISLSGK ALNDVGK Minimum p-value False discovery rate

Decoy peptides can be used to estimate FDR. MSEDEIER ISLSGK CSSSTEAEQR CIVGLTK ALNDVGK Search Decoy Target MSEDEIER 2.2 ISLSGK 1.6 CSSSTEAEQR 1.9 CIVGLTK 2.8 ALNDVGK 2.7 VDPSSWFNN 1.2 QFIDFSTVFQP 1.7 FDR = 1/4 = 25% Elias & Gygi Nat Biotech 2007

stage 1square root XCorr can be calculated as a dot product between two spectra Observed spectrum stage 2normalize regions stage 3cross-correlation pre-processing Theoretical spectrum VNIQEELGK for each peptide bond:b ion y ion neutral losses dot product XCorr score (Eng J Proteome Research 2008)

XCorr can be refactored to make the theoretical spectrum binary stage 3 observed spectrumtheoretical spectrum fingerprint of b / y / neutral losses centered at m i = 347 vector of cleavage evidence sum of evidence for cleavage at m i = 347 VNIQEELGK binary markers of backbone cleavage dot product refactored Xcorr score (Howbert Molecular & Cellular Proteomics 2014)

The refactored XCorr is very similar to the original XCorr ρ=0.995

Distribution of XCorr scores can be computed using dynamic programming

Each column holds a score distribution for all peptides with a given mass O( m  ( s max – s min )  | AA | ) ~ 1 sec for m = 1500

Calibration removes charge state dependency of scores

XCorr p-values must be corrected for multiple testing Sidak correction is similar to Bonferroni Accounts for fact that database search considers many peptides for each spectrum

The resulting p-values are distributed uniformly log(rank p-value) log(p-value) p-value Frequency

Exact calibration improves statistical power to identify spectra yeast-01 MS1 / MS2 low resolution

Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.

Similar presentations

Presentation on theme: "Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.

Similar presentations

Presentation on theme: "Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology."— Presentation transcript:

Similar presentations

About project

Feedback