De novo interpretation of peptide mass spectra

De novo interpretation of peptide mass spectra
Vineet Bafna, UCSD (joint work with Nathan Edwards, ABI and Noah Zaitlen, UCSD)

Nobel Citation 2002

Talk Outline Tandem MS for Peptide Identification Earlier work
Description of algorithm Results and applications

Tandem MS Secondary Fragmentation Ionized parent peptide
Tandem mass spectrometry selects one of the intense peaks observed in the single stage mass spectrum and further fragments all peptides with the selected mass to charge ratio. The tandem mass spectrum typically contains mass to charge ratio information about fragments of a a single peptide. Secondary Fragmentation Ionized parent peptide

The peptide backbone The peptide backbone breaks to form
fragments with characteristic masses. H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 N-terminus C-terminus Tandem MS can be used to determine the amino-acid sequence of a peptide because proteins are made up of amino-acid chains. During secondary fragmentation, the peptide backbone breaks forming fragments with characteristic masses. AA residuei-1 AA residuei AA residuei+1

Ionization The peptide backbone breaks to form
fragments with characteristic masses. H+ H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 N-terminus C-terminus The parent peptide, when ionized, has at least one additional proton attached. AA residuei-1 AA residuei AA residuei+1 Ionized parent peptide

Fragment ion generation
The peptide backbone breaks to form fragments with characteristic masses. H+ H...-HN-CH-CO NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 N-terminus C-terminus When the peptide backbone breaks, the ionizing protons are retained on some of the fragments, which can then have their mass to charge ratio measured. Shown here is a suffix fragment, where the ionizing proton is retained on the C-terminus side of the backbone cleavage site. Also possible is a prefix fragment, where the ionizing proton is retained on the N-terminus side. AA residuei-1 AA residuei AA residuei+1 Ionized peptide fragment

Tandem MS for Peptide ID
88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions 100 % Intensity [M+2H]2+ An assignment of the peptide fragments to the spectral peaks. You can see there is an excellent correspondence between the fragments that are expected and the observed spectral peaks. Note too the presence of the doubly charged, unfragmented parent ion. 250 500 750 1000 m/z

Peak Assignment 88 145 292 405 534 663 778 907 1020 1166 b ions S G F
L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 Peak assignment implies Sequence (Residue tag) Reconstruction! y7 % Intensity [M+2H]2+ An assignment of the peptide fragments to the spectral peaks. You can see there is an excellent correspondence between the fragments that are expected and the observed spectral peaks. Note too the presence of the doubly charged, unfragmented parent ion. y5 b3 b4 y2 y3 y4 b5 y8 b6 b8 b9 b7 y9 250 500 750 1000 m/z

Database Searching For every peptide from a database
Generate a hypothetical spectrum Compute a correlation between observed and experimental spectra Choose the best Database searching is very powerful and is the de facto standard for MS. Sequest, Mascot, and many others

A case for de novo sequencing
With current technology, only about 30% of spectra yield a meaningful hit in a database Multiple Reasons: Incomplete databases (ex: pathogens) Modifications/Mutations Poor quality of fragmentation Can we say anything about such ‘Orphan’ Spectra? In the absence of a complete interpretation, can we have confidence in peak assignments?

De Novo Interpretation: Example
b-ions S G E K y-ions Ion Offsets b=P+1 y=S+19=M-P+19 y 2 100 500 400 300 200 y 1 b 1 b 2 M/Z

Putative Prefix Masses
Dancik et al., Recomb 98 Prefix Mass M=401 b y S G E K

Earlier Work: Spectral Graph
Dancik et al., Recomb 98 Each peak, when assigned to a prefix/suffix ion type generates a unique prefix residue mass. Spectral graph: Each node u defines a putative prefix residue M(u). (u,v) in E if M(v)-M(u) is the residue mass of an a.a. (tag) or 0. Paths in the spectral graph correspond to a interpretation 273 332 87 144 146 275 401 100 200 300 S G E K

Re-defining de novo interpretation
Find a subset of nodes in spectral graph s.t. 0, M are included Each peak contributes at most one node (interpretation)(*) Each adjacent pair (when sorted by mass) is connected by an edge (valid residue mass) An appropriate objective function (ex: the number of peaks interpreted) is maximized (*)In general, finding paths using forbidden pairs is NP-hard 100 400 200 S G E K

Earlier Work: Non-Intersecting Forbidden pairs
Chen et al. , SODA 2000 100 400 200 300 332 87 S G E K If we consider only b,y ions, ‘forbidden’ node pairs are non-intersecting, The de novo problem can be solved efficiently using a dynamic programming technique.

Peptide fragmentation possibilities
Multiple Ion Types Peptide fragmentation possibilities (ion types) xn-i yn-i vn-i yn-i-1 wn-i zn-i -HN-CH-CO-NH-CH-CO-NH- Ri CH-R’ In reality, many more of the bonds between two amino-acids may break. i+1 ai R” i+1 bi bi+1 ci di+1 low energy fragments high energy fragments

Spectra have Multiple Ion types
100 400 200 b-ions and y-ions often constitute a minority of interpretable ions. a-ions, and different neutral losses are very common. High energy spectra display a larger fraction of ions. Multiple ions imply overlapping residue assignments and Chen et al. does not apply.

BE’2003 An efficient algorithm that handles all prefix/suffix fragmentations and their neutral losses using a different d.p. formulation An algorithm to obtain the core interpretation of a spectrum, assignment of confidence values to peak assignments

Multiple Ion Types -HN-CH-CO-NH-CH-CO-NH- CH-R’ Ri R” xn-i yn-i
Residue vn-i yn-i-1 wn-i zn-i -HN-CH-CO-NH-CH-CO-NH- Ri CH-R’ In reality, many more of the bonds between two amino-acids may break. i+1 ai R” i+1 bi bi+1 ci di+1 low energy fragments high energy fragments

Simple Ion Lists Peak i Span M/2 Prefix residue masses

Simple Ion Lists Peak j Peak i l_i r_i 100 200 S 400 Partition the putative residue masses for each peak around M/2 l_i (r_i) is the smallest (largest) mass for peak i The span S of an ion-list is the maximum difference between the putative residue assignments on either LHS or RHS. Define an Ion List as simple if span <= minimum residue mass. l_i < l_j implies r_i > r_j Most natural ion-lists are simple. Ex: a,b,y,b-NH3,b-H20,y- NH3,y- H20

Simple Ion Lists and Spectral Peak ordering
j ri rj 400 200 Order the peaks by increasing right-most putative residue Lemma: If the left side residues ri is chosen from i, and rj is chosen from j >i , then ri >= rj

Ordering spectral peaks
i<j i j Span <= aa

Forward algorithm i-1 i M[w] M[v] Goal: We only have peaks 1..i, and want to find the “best” path from node v (M[v]<= M/2) to node u (M[v]> M/2). Denote score S(i,v,w) Best: Many notions of best.

Forward algorithm i-1 i M[w] M[v] M[u] Since i is the outermost peak, a PRM from it is either connected to v, or to w, or NOT used ever. One option is that none of the PRMs from peak i are used. In that case S[i,v,w] = S[i-1,v,w]

Forward algorithm i-1 i M[w] M[v] M[u] Otherwise, we choose a node u from one of the interpretations of i Node u must have an edge from v or w. If it is from v S(i,v,w) = S(i-1,u,w)

The Forward Algorithm i-1 i M[v] M[u] M[w] Number the peaks in the order of increasing largest prefix residue mass Define S[i,v,w] as the score of the best interpretation from M[v] to M[w] using peaks 1..i S[i,v,w] = max S[i-1,v,w] + f(i) S[i-1,u,w] + g(i,M[u]), if valid S[i-1,v,u] + g(i,M[u]), if valid

Forward Backward Paths
i i+1 M[v] M[w] M S[i][v][w] is the best (M[v],M[w]) inner interpretation using peaks 1..i. Let T[i][v][w] be the best (M[v],M[w]) outer-interpretation using peaks i..m. It is the highest score of a path from 0 to v, and from w to M using peaks i, i+1,…,m

Forward Backward Scoring
M[v] M[w] M S[i][v][w] is the best (M[v],M[w]) inner interpretation using peaks 1..i. Let T[i][v][w] be the best (M[v],M[w]) outer-interpretation using peaks i..m. T[i+1,v,w] + f(i) S[i+1,u,w] + g(i,M[u]), if valid S[i+1,v,u] + g(i,M[u]), if valid T[i,v,w] = max

Core Interpretations Do we care about this?
Suppose we can answer the following question: What is the best scoring interpretation if we fix the interpretation of peak i to something (EX: b, y-H2O)? This is equivalent to assigning peak i to one of its nodes Do we care about this?

Core interpretations If global score after assigning peak i to a node u is Much higher than a score due to any other interpretation Then, the interpretation of peak I is likely to be correct even if the global interpretation is not. This allows us to interpret peaks with incomplete fragmentation

Core Interpretations Define H[i,u] as the highest scoring interpretation in which peak i is assigned to g[u] If H[i,u] = S[m,0,n] for some u, and H[i,v] << S[m,0,n] then M[u] is probably the correct interpretation for I H[i,u] = g(i,M[u]) + Max_v (S[i-1,v,u] + T[i+1,v,u]) v<u Max_w (S[i-1,u,w] + T[i+1,u,w]) u<w max

More theory Reduce dimensionality of the recurrence.
Generate sub-optimal paths

Some results How good is de novo?

Simulation test data set
Given a peptide, generate artificial tandem MS with differing intensities (as in SEQUEST) b,y with intensity 1.0, a with intensity 0.5, appropriate neutral losses with intensity 0.2 Parameters g,e are chosen (fragmentation and error probability respectively). Each fragment is generated with probability min{ g i,1}, where i is the intensity. An error offset is chosen uniformly at random from [-e,e].

Results: %Positions predicted
Ions/Position

Results:Peptide Id Ions/Position

Results: % TIC explained
Ions/Position

Results: Core Interpretation
% Score Difference

Performance on real data sets
Zufar et al. (~150 spectra) ISB (~500 spectra) Performance Best spectral interpretation was chosen for each spectra. For tags of different lengths, the number of spectra with a correctly predicted sequence tag was reported.

De novo Performance on real data sets

Parameter optimization
Peak selection parameters Scoring parameters (for different interpretations) A simulated annealing step is used to optimize parameters on a learning data set.

De novo performance (optimization)

De novo analysis ~50% of spectra have a correct 5-mer prediction
Worthwhile if databases are incomplete. Very useful as filters. Mann & Wilm, MS Blast, GutenTag all use generated tags as filters

De novo interpretation as a filter
Only tags from high scoring paths are used. Fewer tags implies fewer candidate peptides Experiment: All substrings of lengths 3-6 were chosen as tags The Uniprot database (143K proteins, 100Mb) was searched with these tag filters. A peptide is a candidate if the tag and flanking masses are consistent with the interpretation M1 M2 IVLSDFYLDEERVADCVLL

Tags as effective filters

Tags as filters Pros: Cons
Eliminate a large portion of the database from consideration Allow for efficient search using keyword trees (tries). This dictionary search is independent of the size of the tag space Allow Searching with Post-translational modifications Cons Filter out some true hits. Dependent on interpretation. Suboptimal paths must be considered as well General filters based on interpretations can be built

Mass spectra with PT modification
Most database search software allow PT modifications Search based on generation of modified candidate peptides A combinatorial expansion occurs. Searching with PT modifications is computationally intensive.

Efficient handling of PT modifications
Consider the peptide SDFTYLDER S,T,Y can all be phosphorylated (or not) giving 8 possibilities Parent mass consideration reduces this, but there are still a large number of possibilities. With an increase in the type of PT modifications, this number can become very large De novo interpretation complexity is unchanged in the presence of PT modifications

PT modifications

Conclusions Computational methods for de novo analysis are continually improving. Things should get much better with technology improvement Great potential as filters Searching large genomic databases Searching in the presence of PT modifications Filtering without tags De novo interpretation should be revived from the backwaters of MS analysis.

Acknowledgments Nathan Edwards, ABI UCSD:
Nuno Bandeira, Ari Frank, Qian Peng, Pavel A. Pevzner, Noah Zaitlen

De novo interpretation of peptide mass spectra

Similar presentations

Presentation on theme: "De novo interpretation of peptide mass spectra"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

De novo interpretation of peptide mass spectra

Similar presentations

Presentation on theme: "De novo interpretation of peptide mass spectra"— Presentation transcript:

Similar presentations

About project

Feedback