De novo interpretation of peptide mass spectra

Slides:



Advertisements
Similar presentations
Protein Sequencing and Identification by Mass Spectrometry.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1336 SW Bertha Blvd, Portland OR 97219
De Novo Sequencing and Homology Searching with De Novo Sequence Tags.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
Protein Sequencing and Identification by Mass Spectrometry.
Fa 05CSE182 CSE182-L7 Protein sequencing and Mass Spectrometry.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
PEAKS: De Novo Sequencing using MS/MS spectra Bin Ma, U. Western Ontario, Canada Kaizhong Zhang,U. Western Ontario, Canada Chengzhi Liang, Bioinformatics.
Mass Spectrometry Peptide identification
The restriction mapping problem revisited Gopal Pandurangan and H. Ramesh Journal of Computer and System Sciences 526~544(2002)
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
De Novo Sequencing of MS Spectra
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
My contact details and information about submitting samples for MS
A combination of the words Proteomics and Genomics. Proteogenomics commonly refer to studies that use proteomic information, often derived from mass spectrometry,
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Rotamer Packing Problem: The algorithms Hugo Willy 26 May 2010.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
CH 908: Mass Spectrometry Lecture 4 Interpreting Electron Impact Mass Spectra – Continued… Recommended: Read chapters 8-9 of McLafferty Prof. Peter B.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
Laxman Yetukuri T : Modeling of Proteomics Data
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Peptide Identification via Tandem Mass Spectrometry Sorin Istrail.
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
De Novo Peptide Sequencing via Probabilistic Network Modeling PepNovo.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Constructing high resolution consensus spectra for a peptide library
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
Mass Spectrometry 101 (continued) Hackert - CH 370 / 387D
‘Protein sequencing’: Determining protein sequences
Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
A Database of Peak Annotations of Empirically Derived Mass Spectra
MassMatrix Search Results Explained
Protein Identification via Database searching
The Greedy Method and Text Compression
Minimum Spanning Tree 8/7/2018 4:26 AM
Computability and Complexity
Estimating Recombination Rates
Merge Sort 11/28/2018 2:18 AM The Greedy Method The Greedy Method.
Merge Sort 11/28/2018 2:21 AM The Greedy Method The Greedy Method.
Merge Sort 11/28/2018 8:16 AM The Greedy Method The Greedy Method.
Computing Xcorr exact p values
3.5 Minimum Cuts in Undirected Graphs
Proteomics Informatics David Fenyő
Interpretation of Mass Spectra I
Merge Sort 1/17/2019 3:11 AM The Greedy Method The Greedy Method.
Chapter 11 Limitations of Algorithm Power
Proteomics Informatics –
Protein Identification Using Tandem Mass Spectrometry
Merge Sort 5/2/2019 7:53 PM The Greedy Method The Greedy Method.
Interpretation of Mass Spectra
Fragment Assembly 7/30/2019.
Protein Identification by Sequence Database Search
(Journal of Computational Biology, 2001) (SODA, 2000)
Presentation transcript:

De novo interpretation of peptide mass spectra Vineet Bafna, UCSD (joint work with Nathan Edwards, ABI and Noah Zaitlen, UCSD)

Nobel Citation 2002

Talk Outline Tandem MS for Peptide Identification Earlier work Description of algorithm Results and applications

Tandem MS Secondary Fragmentation Ionized parent peptide Tandem mass spectrometry selects one of the intense peaks observed in the single stage mass spectrum and further fragments all peptides with the selected mass to charge ratio. The tandem mass spectrum typically contains mass to charge ratio information about fragments of a a single peptide. Secondary Fragmentation Ionized parent peptide

The peptide backbone The peptide backbone breaks to form fragments with characteristic masses. H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 N-terminus C-terminus Tandem MS can be used to determine the amino-acid sequence of a peptide because proteins are made up of amino-acid chains. During secondary fragmentation, the peptide backbone breaks forming fragments with characteristic masses. AA residuei-1 AA residuei AA residuei+1

Ionization The peptide backbone breaks to form fragments with characteristic masses. H+ H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 N-terminus C-terminus The parent peptide, when ionized, has at least one additional proton attached. AA residuei-1 AA residuei AA residuei+1 Ionized parent peptide

Fragment ion generation The peptide backbone breaks to form fragments with characteristic masses. H+ H...-HN-CH-CO NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 N-terminus C-terminus When the peptide backbone breaks, the ionizing protons are retained on some of the fragments, which can then have their mass to charge ratio measured. Shown here is a suffix fragment, where the ionizing proton is retained on the C-terminus side of the backbone cleavage site. Also possible is a prefix fragment, where the ionizing proton is retained on the N-terminus side. AA residuei-1 AA residuei AA residuei+1 Ionized peptide fragment

Tandem MS for Peptide ID 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions 100 % Intensity [M+2H]2+ An assignment of the peptide fragments to the spectral peaks. You can see there is an excellent correspondence between the fragments that are expected and the observed spectral peaks. Note too the presence of the doubly charged, unfragmented parent ion. 250 500 750 1000 m/z

Peak Assignment 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 Peak assignment implies Sequence (Residue tag) Reconstruction! y7 % Intensity [M+2H]2+ An assignment of the peptide fragments to the spectral peaks. You can see there is an excellent correspondence between the fragments that are expected and the observed spectral peaks. Note too the presence of the doubly charged, unfragmented parent ion. y5 b3 b4 y2 y3 y4 b5 y8 b6 b8 b9 b7 y9 250 500 750 1000 m/z

Database Searching For every peptide from a database Generate a hypothetical spectrum Compute a correlation between observed and experimental spectra Choose the best Database searching is very powerful and is the de facto standard for MS. Sequest, Mascot, and many others

A case for de novo sequencing With current technology, only about 30% of spectra yield a meaningful hit in a database Multiple Reasons: Incomplete databases (ex: pathogens) Modifications/Mutations Poor quality of fragmentation Can we say anything about such ‘Orphan’ Spectra? In the absence of a complete interpretation, can we have confidence in peak assignments?

De Novo Interpretation: Example 0 88 145 274 402 b-ions S G E K 420 333 276 147 0 y-ions Ion Offsets b=P+1 y=S+19=M-P+19 y 2 100 500 400 300 200 y 1 b 1 b 2 M/Z

Putative Prefix Masses Dancik et al., Recomb 98 Prefix Mass M=401 b y 88 87 332 145 144 275 147 146 273 276 275 144 S G E K 0 87 144 273 401

Earlier Work: Spectral Graph Dancik et al., Recomb 98 Each peak, when assigned to a prefix/suffix ion type generates a unique prefix residue mass. Spectral graph: Each node u defines a putative prefix residue M(u). (u,v) in E if M(v)-M(u) is the residue mass of an a.a. (tag) or 0. Paths in the spectral graph correspond to a interpretation 273 332 87 144 146 275 401 100 200 300 S G E K

Re-defining de novo interpretation Find a subset of nodes in spectral graph s.t. 0, M are included Each peak contributes at most one node (interpretation)(*) Each adjacent pair (when sorted by mass) is connected by an edge (valid residue mass) An appropriate objective function (ex: the number of peaks interpreted) is maximized (*)In general, finding paths using forbidden pairs is NP-hard 100 400 200 S G E K

Earlier Work: Non-Intersecting Forbidden pairs Chen et al. , SODA 2000 100 400 200 300 332 87 S G E K If we consider only b,y ions, ‘forbidden’ node pairs are non-intersecting, The de novo problem can be solved efficiently using a dynamic programming technique.

Peptide fragmentation possibilities Multiple Ion Types Peptide fragmentation possibilities (ion types) xn-i yn-i vn-i yn-i-1 wn-i zn-i -HN-CH-CO-NH-CH-CO-NH- Ri CH-R’ In reality, many more of the bonds between two amino-acids may break. i+1 ai R” i+1 bi bi+1 ci di+1 low energy fragments high energy fragments

Spectra have Multiple Ion types 100 400 200 b-ions and y-ions often constitute a minority of interpretable ions. a-ions, and different neutral losses are very common. High energy spectra display a larger fraction of ions. Multiple ions imply overlapping residue assignments and Chen et al. does not apply.

BE’2003 An efficient algorithm that handles all prefix/suffix fragmentations and their neutral losses using a different d.p. formulation An algorithm to obtain the core interpretation of a spectrum, assignment of confidence values to peak assignments

Multiple Ion Types -HN-CH-CO-NH-CH-CO-NH- CH-R’ Ri R” xn-i yn-i Residue vn-i yn-i-1 wn-i zn-i -HN-CH-CO-NH-CH-CO-NH- Ri CH-R’ In reality, many more of the bonds between two amino-acids may break. i+1 ai R” i+1 bi bi+1 ci di+1 low energy fragments high energy fragments

Simple Ion Lists Peak i Span M/2 Prefix residue masses

Simple Ion Lists Peak j Peak i l_i r_i 100 200 S 400 Partition the putative residue masses for each peak around M/2 l_i (r_i) is the smallest (largest) mass for peak i The span S of an ion-list is the maximum difference between the putative residue assignments on either LHS or RHS. Define an Ion List as simple if span <= minimum residue mass. l_i < l_j implies r_i > r_j Most natural ion-lists are simple. Ex: a,b,y,b-NH3,b-H20,y- NH3,y- H20

Simple Ion Lists and Spectral Peak ordering j ri rj 400 200 Order the peaks by increasing right-most putative residue Lemma: If the left side residues ri is chosen from i, and rj is chosen from j >i , then ri >= rj

Ordering spectral peaks i<j i j Span <= aa

Forward algorithm i-1 i M[w] M[v] Goal: We only have peaks 1..i, and want to find the “best” path from node v (M[v]<= M/2) to node u (M[v]> M/2). Denote score S(i,v,w) Best: Many notions of best.

Forward algorithm i-1 i M[w] M[v] M[u] Since i is the outermost peak, a PRM from it is either connected to v, or to w, or NOT used ever. One option is that none of the PRMs from peak i are used. In that case S[i,v,w] = S[i-1,v,w]

Forward algorithm i-1 i M[w] M[v] M[u] Otherwise, we choose a node u from one of the interpretations of i Node u must have an edge from v or w. If it is from v S(i,v,w) = S(i-1,u,w)

The Forward Algorithm i-1 i M[v] M[u] M[w] Number the peaks in the order of increasing largest prefix residue mass Define S[i,v,w] as the score of the best interpretation from M[v] to M[w] using peaks 1..i S[i,v,w] = max S[i-1,v,w] + f(i) S[i-1,u,w] + g(i,M[u]), if valid S[i-1,v,u] + g(i,M[u]), if valid

Forward Backward Paths i i+1 M[v] M[w] M S[i][v][w] is the best (M[v],M[w]) inner interpretation using peaks 1..i. Let T[i][v][w] be the best (M[v],M[w]) outer-interpretation using peaks i..m. It is the highest score of a path from 0 to v, and from w to M using peaks i, i+1,…,m

Forward Backward Scoring M[v] M[w] M S[i][v][w] is the best (M[v],M[w]) inner interpretation using peaks 1..i. Let T[i][v][w] be the best (M[v],M[w]) outer-interpretation using peaks i..m. T[i+1,v,w] + f(i) S[i+1,u,w] + g(i,M[u]), if valid S[i+1,v,u] + g(i,M[u]), if valid T[i,v,w] = max

Core Interpretations Do we care about this? Suppose we can answer the following question: What is the best scoring interpretation if we fix the interpretation of peak i to something (EX: b, y-H2O)? This is equivalent to assigning peak i to one of its nodes Do we care about this?

Core interpretations If global score after assigning peak i to a node u is Much higher than a score due to any other interpretation Then, the interpretation of peak I is likely to be correct even if the global interpretation is not. This allows us to interpret peaks with incomplete fragmentation

Core Interpretations Define H[i,u] as the highest scoring interpretation in which peak i is assigned to g[u] If H[i,u] = S[m,0,n] for some u, and H[i,v] << S[m,0,n] then M[u] is probably the correct interpretation for I H[i,u] = g(i,M[u]) + Max_v (S[i-1,v,u] + T[i+1,v,u]) v<u Max_w (S[i-1,u,w] + T[i+1,u,w]) u<w max

More theory Reduce dimensionality of the recurrence. Generate sub-optimal paths

Some results How good is de novo?

Simulation test data set Given a peptide, generate artificial tandem MS with differing intensities (as in SEQUEST) b,y with intensity 1.0, a with intensity 0.5, appropriate neutral losses with intensity 0.2 Parameters g,e are chosen (fragmentation and error probability respectively). Each fragment is generated with probability min{ g i,1}, where i is the intensity. An error offset is chosen uniformly at random from [-e,e].

Results: %Positions predicted Ions/Position

Results:Peptide Id Ions/Position

Results: % TIC explained Ions/Position

Results: Core Interpretation % Score Difference

Performance on real data sets Zufar et al. (~150 spectra) ISB (~500 spectra) Performance Best spectral interpretation was chosen for each spectra. For tags of different lengths, the number of spectra with a correctly predicted sequence tag was reported.

De novo Performance on real data sets

Parameter optimization Peak selection parameters Scoring parameters (for different interpretations) A simulated annealing step is used to optimize parameters on a learning data set.

De novo performance (optimization)

De novo analysis ~50% of spectra have a correct 5-mer prediction Worthwhile if databases are incomplete. Very useful as filters. Mann & Wilm, MS Blast, GutenTag all use generated tags as filters

De novo interpretation as a filter Only tags from high scoring paths are used. Fewer tags implies fewer candidate peptides Experiment: All substrings of lengths 3-6 were chosen as tags The Uniprot database (143K proteins, 100Mb) was searched with these tag filters. A peptide is a candidate if the tag and flanking masses are consistent with the interpretation M1 M2 IVLSDFYLDEERVADCVLL

Tags as effective filters

Tags as filters Pros: Cons Eliminate a large portion of the database from consideration Allow for efficient search using keyword trees (tries). This dictionary search is independent of the size of the tag space Allow Searching with Post-translational modifications Cons Filter out some true hits. Dependent on interpretation. Suboptimal paths must be considered as well General filters based on interpretations can be built

Mass spectra with PT modification Most database search software allow PT modifications Search based on generation of modified candidate peptides A combinatorial expansion occurs. Searching with PT modifications is computationally intensive.

Efficient handling of PT modifications Consider the peptide SDFTYLDER S,T,Y can all be phosphorylated (or not) giving 8 possibilities Parent mass consideration reduces this, but there are still a large number of possibilities. With an increase in the type of PT modifications, this number can become very large De novo interpretation complexity is unchanged in the presence of PT modifications

PT modifications

Conclusions Computational methods for de novo analysis are continually improving. Things should get much better with technology improvement Great potential as filters Searching large genomic databases Searching in the presence of PT modifications Filtering without tags De novo interpretation should be revived from the backwaters of MS analysis.

Acknowledgments Nathan Edwards, ABI UCSD: Nuno Bandeira, Ari Frank, Qian Peng, Pavel A. Pevzner, Noah Zaitlen