Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.

Slides:



Advertisements
Similar presentations
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Advertisements

2 3 J. Proteome Res., 2011, 10 (1), pp 153–160 DOI: /pr100677g.
1336 SW Bertha Blvd, Portland OR 97219
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification Overview Statistical Significance in Peptide Identification.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
Fa 05CSE182 CSE182-L7 Protein sequencing and Mass Spectrometry.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
EBI is an Outstation of the European Molecular Biology Laboratory. MS Identification Dr. Juan Antonio VIZCAINO PRIDE Group coordinator PRIDE team, Proteomics.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Scaffold Download free viewer:
Statistical calibration of MS/MS spectrum library search scores Barbara Frewen January 10, 2011 University of Washington.
My contact details and information about submitting samples for MS
Goals in Proteomics 1.Identify and quantify proteins in complex mixtures/complexes 2.Identify global protein-protein interactions 3.Define protein localizations.
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)
Facts and Fallacies about de Novo Sequencing & Database Search.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
Multiple testing correction
Proteomics Informatics – Data Analysis and Visualization (Week 13)
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Algorithmic Problems in Peptide Sequencing
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Protein Identification by Database Searching John Cottrell Matrix Science.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
Proteomics What is it? How is it done? Are there different kinds? Why would you want to do it (what can it tell you)?
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
A Reference Library of Peptide Ion Fragmentation Spectra Stephen Stein 1 ; Lisa Kilpatrick 2 ; Pedatsur Neta 1 ; Jeri Roth 1 ; Xiaoyu Yang 1 National Institute.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
Eat Raw & Fresh: Introducing isotopic Mass-to-charge Ratio and Envelope Fingerprinting (iMEF) and ProteinGoggle for Protein Database Search Zhixin(Michael)
Background Spectral library searching Spectral library searching is an alternative approach to traditional sequence database searching for peptide inference.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
De Novo Peptide Sequencing via Probabilistic Network Modeling PepNovo.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Constructing high resolution consensus spectra for a peptide library
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
MassMatrix Search Results Explained
Protein Identification via Database searching
Peptide identification by peptide fragmentation fingerprinting
Computing Xcorr exact p values
Proteomics Informatics David Fenyő
Interpretation of Mass Spectra I
Proteomics Informatics –
NoDupe algorithm to detect and group similar mass spectra.
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Proteomics Informatics David Fenyő
Interpretation of Mass Spectra
Presentation transcript:

Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Outline Tandem mass spectrometry overview Peptide identification –De novo –Peptide database search –Spectrum database search Score calibration Statistical confidence estimation

Mass spectrometry background

EAMPK GDIFYPGYCPDVK LPLENENQGK ASVYNSFVSNGVK YVMTFK ENQGVVNR

Peptide fragmentation

Peptide fragmentation spectrum m/z Intensity VVVTGLGMLSPVGNTVESTWK

EAMPK EAMPK? Our first goal is to identify each spectrum

Peptide identification

Two approaches Frank et al. JPR

The peptide can be inferred using pairs of nearby peaks IYEVEGMR

The spectrum graph considers all possible peptides Frank et al. JPR

What are the pros and cons of the de novo approach? +Dynamic programming can quickly find peptides that fit a given spectrum graph. -Many real spectra don’t yield a very good spectrum graph. +It is possible to match against a noisy spectrum by allowing missing peaks. -If you allow lots of missing peaks, then the DP is slow and leads to many false positive matches. -If instead you search a database, you limit the number of possible false positives. -In practice, the de novo approach does not identify many spectra. +The de novo approach is the only one that can find previously unknown peptides.

The database search approach Nesvizhskii et al. Nature Methods

SEQUEST theoretical spectrum y-ion, charge 1b-ion, charge 2 flanks H 2 O loss NH 3 loss a-ion

16 Theoretical Peaks Observed Peaks

SEQUEST cross-correlation score Define R i as the scalar product of the two spectra, with one offset by i. The score, called XCorr, is R 0 minus the average R i for i in -75, …, 75.

Spectrum Peptide database Spectrum comparison function Theoretical spectrum generator

The X!Tandem hyperscore Number of matched b-ions Number of matched y-ions Boolean: Is the peak at m/z value i a b- or y- ion? Intensity of the peak at m/z value i

A third approach to identifying spectra 1.De novo identification 2.Search against a database of theoretical spectra 3.Search against a database of previously observed, identified spectra

Spectrum Identification Database: fasta file … SEQUEST Peptide ID list >MEKK1 (kinase) MDRILARMKKSTRRGGDKNIT PVRRLERR… >ATMKK5 (kinase kinase) MKPIQSPSGVASPMKNRLRK RPDLSPPLPHRDVALAVLP… MS/MS query spectra ID proteins from peptides… Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BiblioSpec 3 NGISLTIVR 3 QWDKEPPR 2 FMACSDEK 1 CGCCLYNT 2 GDTIENFK Library of identified spectra

Spectrum Identification SEQUEST Peptide ID list MS/MS query spectra Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BiblioSpec 3 NGISLTIVR 3 QWDKEPPR 2 FMACSDEK 1 CGCCLYNT 2 GDTIENFK Library of identified spectra m/z score = 0.2

What are the pros and cons of library searching? +Because the spectrum library contains peak intensity information, matching can be done accurately. +Library searching is faster than database searching. ̶ Library searching can only identify peptides that have been previously identified.

Database search tools Greylag MASCOT MS-Tag (ProteinProspector) Massmatrix MyriMatch OMSSA Olav Pepfrag (Prowl) Pepprobe Pepsplice Pfind Phenyx ProLuCID (YADA) ProbID RAId_DbS SEQUEST SpectrumMill VEMS X!Tandem Landscape of Peptide Identification Software Spectral matching tools Bibliospec SpectraST X! P3 De novo sequencing tools Lutefisk PEAKS PepNovo Sequit Sequence tag/hybrid approaches ByOnic/LookupPeaks GutenTag Inspect Paragon Popitam Others Proteinlynx SonarMS/MS (knexus) Xproteo

Score calibration

Searching many spectra yields a set of peptide-spectrum matches PSMs Spectra Peptides

Identifying spectra requires solving two distinct problems well PSMs Task 1: Ranking candidate peptides with respect to a single spectrum Task 2: Ranking PSMs with respect to one another

Different spectra yield different score distributions

Different charge states yield different score distributions

Estimating a p-value can improve calibration The probability of observing a score >4 is the area under the curve to the right of 4.

XCorr scores fit a Weibull distribution (Klammer J Proteome Research 2009)

X!Tandem and Comet use a log linear fit XCorr log(count) XCorr of 5 corresponds to count of (Eng J Proteome Research 2009)

Calibration improves statistical power to identify spectra Calibrated Uncalibrated FDR threshold Identified spectra

Statistical confidence estimation Elias & Gygi Nat Biotech 2007

Our second goal is to identify a set of spectra at a given false discovery rate

Spectrum identification must account for two types of multiple testing MSEDEIER VDPSSWFNN CSSSTEAEQR CIVGLTK QFIDFSTVFQP ISLSGK ALNDVGK Minimum p-value False discovery rate

Decoy peptides can be used to estimate FDR. MSEDEIER ISLSGK CSSSTEAEQR CIVGLTK ALNDVGK Search Decoy Target MSEDEIER 2.2 ISLSGK 1.6 CSSSTEAEQR 1.9 CIVGLTK 2.8 ALNDVGK 2.7 VDPSSWFNN 1.2 QFIDFSTVFQP 1.7 FDR = 1/4 = 25% Elias & Gygi Nat Biotech 2007

stage 1square root XCorr can be calculated as a dot product between two spectra Observed spectrum stage 2normalize regions stage 3cross-correlation pre-processing Theoretical spectrum VNIQEELGK for each peptide bond:b ion y ion neutral losses dot product XCorr score (Eng J Proteome Research 2008)

XCorr can be refactored to make the theoretical spectrum binary stage 3 observed spectrumtheoretical spectrum fingerprint of b / y / neutral losses centered at m i = 347 vector of cleavage evidence sum of evidence for cleavage at m i = 347 VNIQEELGK binary markers of backbone cleavage dot product refactored Xcorr score (Howbert Molecular & Cellular Proteomics 2014)

The refactored XCorr is very similar to the original XCorr ρ=0.995

Distribution of XCorr scores can be computed using dynamic programming

Each column holds a score distribution for all peptides with a given mass O( m  ( s max – s min )  | AA | ) ~ 1 sec for m = 1500

Calibration removes charge state dependency of scores

XCorr p-values must be corrected for multiple testing Sidak correction is similar to Bonferroni Accounts for fact that database search considers many peptides for each spectrum

The resulting p-values are distributed uniformly log(rank p-value) log(p-value) p-value Frequency

Exact calibration improves statistical power to identify spectra yeast-01 MS1 / MS2 low resolution