(Journal of Computational Biology, 2001) (SODA, 2000)

Slides:



Advertisements
Similar presentations
Kaizhong Zhang Department of Computer Science University of Western Ontario London, Ontario, Canada Joint work with Bin Ma, Gilles Lajoie, Amanda Doherty-Kirby,
Advertisements

PHYLOGENETIC TREES Bulent Moller CSE March 2004.
Lecture 17 Path Algebra Matrix multiplication of adjacency matrices of directed graphs give important information about the graphs. Manipulating these.
Chapter 3 The Greedy Method 3.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
Tirgul 12 Algorithm for Single-Source-Shortest-Paths (s-s-s-p) Problem Application of s-s-s-p for Solving a System of Difference Constraints.
Protein Sequencing and Identification by Mass Spectrometry.
Fa 05CSE182 CSE182-L7 Protein sequencing and Mass Spectrometry.
Dynamic Programming Reading Material: Chapter 7..
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
The restriction mapping problem revisited Gopal Pandurangan and H. Ramesh Journal of Computer and System Sciences 526~544(2002)
Tirgul 13. Unweighted Graphs Wishful Thinking – you decide to go to work on your sun-tan in ‘ Hatzuk ’ beach in Tel-Aviv. Therefore, you take your swimming.
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
1 An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry Ming-Yang Kao Department of Computer Science Northwestern University Evanston,
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
Dynamic Programming – Part 2 Introduction to Algorithms Dynamic Programming – Part 2 CSE 680 Prof. Roger Crawfis.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Physical Mapping of DNA Shanna Terry March 2, 2004.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Chapter 2 Graph Algorithms.
Common parameters At the beginning one need to set up the parameters.
Algorithmic Problems in Peptide Sequencing
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
Dynamic Programming: Manhattan Tourist Problem Lecture 17.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
Graph Theory. undirected graph node: a, b, c, d, e, f edge: (a, b), (a, c), (b, c), (b, e), (c, d), (c, f), (d, e), (d, f), (e, f) subgraph.
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
De Novo Peptide Sequencing via Probabilistic Network Modeling PepNovo.
Chapter 13 Backtracking Introduction The 3-coloring problem
Introduction to NP Instructor: Neelima Gupta 1.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
The NP class. NP-completeness
More NP-Complete and NP-hard Problems
Hans Bodlaender, Marek Cygan and Stefan Kratsch
The minimum cost flow problem
Lecture 22 Complexity and Reductions
Graph Algorithms Using Depth First Search
Sequence Alignment 11/24/2018.
Proteomics Informatics David Fenyő
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Dynamic Programming 1/15/2019 8:22 PM Dynamic Programming.
3. Brute Force Selection sort Brute-Force string matching
Longest Common Subsequence
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Lecture 8. Paradigm #6 Dynamic Programming
3. Brute Force Selection sort Brute-Force string matching
Bioinformatics for Proteomics
Mass Spectrometry THE MAIN USE OF MS IN ORG CHEM IS:
Processing of fragment ion information in DTA files to remove isotope ions and noise. Processing of fragment ion information in DTA files to remove isotope.
Graph Algorithms DS.GR.1 Chapter 9 Overview Representation
Approximation Algorithms for the Selection of Robust Tag SNPs
Analysis of Algorithms CS 477/677
The Greedy Approach Young CS 530 Adv. Algo. Greedy.
Instructor: Aaron Roth
Instructor: Aaron Roth
Instructor: Aaron Roth
Proteomics Informatics David Fenyő
Error Correction Coding
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Lecture 22 Complexity and Reductions
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

(Journal of Computational Biology, 2001) (SODA, 2000) A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry Ting Chen, Ming-Yang Kao, Matthew Tepel, John Rush, and George M. Church (Journal of Computational Biology, 2001) (SODA, 2000) Speaker: Yao-Ting Huang Hung-Lung Wang 2019/8/3

Introduction De novo peptide sequencing Mass Spectrometer identifies peptide sequence without the help from a protein database, and is especially useful in the identification of unknown proteins. Mass Spectrometer is an instrument that measures the molecular weight of chemical compounds according to their mass-to-charge ratio. Tandem mass spectrometry (MS/MS) plays an important role in protein identification due to its fastness and high sensitivity.

Tandem Mass Spectrometry (MS/MS) Electrospray Ionization (ESI) Enzyme Mass/Charge 1st Mass Spectrometer Ionized Peptides Proteins Peptides b-ions y-ions … Mass/Charge 2nd Mass Spectrometer One Peptide Fragmentation & Ionization De Novo Peptide Sequencing Protein Database Searching Kao’s illustration

Fragmentation and Ionization H – N – C – C |H | R1 H | O || N – C – C |H | Ri H | O || H | O || H | O || … N – C – C … N – C – C – OH |H | Ri+1 |H | Rn b-ion (N-terminal) y-ion (C-terminal) + + H – N – C – C |H | R1 H | O || N – C – C |H | Ri H | O || H | O || H | O || … H – N – C – C … N – C – C – OH |H | Ri+1 |H | Rn

Fragmentation and Ionization Given a peptide sequence: α S – W – R β α = H = 1 β = 2H + OH = 19 Prefix b-ion sequence Suffix y-ion sequence y ( S - W - R × + β ) 3 y ( W - R × ) + b ( α × S ) + β 2 1 y ( R × ) + b ( - + β α × S W ) 1 2 b ( S - - α × W R ) + 3 Complementary ion pairs: b1/y2 and b2/y1

Ideal Tandem Mass Spectrometry S = 87.08, W = 186.21, R = 156.19 y-ions 175.113 361.121 448.225 R W S 88.033 274.112 430.213 Abundance (100%) b-ions S W R Mass / Charge All b-ions form a forward mass ladder All y-ions form a reverse mass ladder

Problem 1: Ideal De Novo Peptide Sequencing 175.113 361.121 448.225 88.033 274.112 430.213 Abundance (100%) Mass / Charge We do not know whether an ion is a b-ion or an y-ion.

Problem 2: Ideal De Novo Peptide Sequencing 274.112 361.121 Mass / Charge Abundance (100%) Some ions may be missing.

Noise and Amino acid Modification Each ion has multiple isotopic forms e.g., C12 and C13. Ions other than b and y may appear e.g., a-ion and z-ion. Some ions may lose a water or an amino acid. Some ions may have multiple charges. Amino acid modification is an amino acid with slightly different atoms (and thus a different mass). Amino acid modifications are usually related to protein functions.

Problem 3: De Novo Peptide Sequencing Abundance (100%) Mass / Charge Noise data appears in the mass spectrometry.

Related Works The mass spectrometry can be compared to the peptide database. SEQUEST (by Eng et al., 1994), Mascot (by Perkins et al., 1999), ProteinProspector (by Clauser et al., 1999) De novo peptide sequencing extracts candidate peptide sequences directly from the mass spectrometry. Dancik et al. (1999) create a directed acyclic graph called spectrum graph. A mass peak is transformed into several nodes and each node represents a possible prefix subsequence. An edge connects two nodes that differ by the total mass of some amino acids. Apply Cormen’s algorithm for finding longest path in the graph.

Result of This Paper Chen et al. observes that Dancik’s approach may tend to include multiple nodes associated with the same mass peak, which is rare in practice. Create a new NC-specturm graph G=(V, E), where V=2k+2 and k is the number of mass peaks (ions). The ideal de novo peptide sequencing problem can be solved in O(|V|+|E|) time and O(|V|) space. The de novo peptide sequencing problem can be solved in O(|V||E|) time and O(|V|2) space. A modified amino acid can be found in O(|V||E|) time.

Ideal De Novo Peptide Sequencing Problem Input: the parent mass W of an unknown peptide P, k mass peaks (ions) I1, I2, …, Ik, and masses w1, w2, …, wk of these ions. Output: A peptide sequence Q such that a subset of its prefixes and suffixes gives the same mass peaks.

Construction of the NC-spectrum Graph Create a pair of nodes, Nj and Cj, for each ion Ij . Create two auxiliary nodes N0 and C0. to represent the zero mass and parent mass, respectively. Let V = {N0 , N1 , …, Nk , C0 , C1 , …, Ck}. Each node x is placed at a real line and assigned coordinate cord(x) according to the total mass of its amino acids, that is, N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 429.22

Construction of the NC-spectrum Graph W = 447.225 361.121 274.112 Abundance (100%) Mass / Charge N0 C0 429.22

Construction of the NC-spectrum Graph W = 447.225 361.121 274.112 Abundance (100%) Mass / Charge N0 C1 N1 C0 174.11 273.11 429.22

Construction of the NC-spectrum Graph W = 447.225 361.121 274.112 Abundance (100%) Mass / Charge Solutions in Z1 includes that of Z2 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 429.22

Construction of the NC-spectrum Graph Mass(S) = 87.08 S Mass(W) = 186.21 W R Mass(R) = 156.19 N0 N2 C1 N1 C2 C0 S+W Mass(S+W) = 273.29 87.10 174.11 273.11 360.12 429.22

Construction of the NC-spectrum Graph 87.10 174.11 273.11 360.12 429.22 Each path from N0 to C0 represents a possible sequence for the peptide A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

Construction of the NC-spectrum Graph 87.10 174.11 273.11 360.12 429.22 This is not a feasible path: miss ion I2

Construction of the NC-spectrum Graph 87.10 174.11 273.11 360.12 429.22 This is not a feasible path: repeat ion I1

Construction of the NC-spectrum Graph 87.10 174.11 273.11 360.12 429.22 This is a feasible path

Determine the Mass of Amino Acids Input: the maximum mass h and mass precision δ. Output: a mass array A which takes an input of m and returns 1 is m equals the total mass of some amino acids The mass array A can be constructed in O(h/δ). A is computed from A[0] to A[h/δ] by assigning A[m]=1 iff m equals one amino acid mass or there exists an amino acid r such that A[m - r]=1. Since there are only 20 amino acids (i.e., |r|<=20), the running time is O(h/δ). The NC-spectrum graph G can be constructed in O(k2).

Problem Reformulation Input: an NC-spectrum graph G. Output: a feasible path from N0 to C0. Difficulty: A longest path does not always go through exactly one of each pair of nodes. It is an NP-hard problem if the graph is a general directed graph.

Renaming Nodes Rename the nodes from left to right as X0 ,…, Xk ,Yk ,…,Y0 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 429.22 X0 X1 X2 Y2 Y1 Y0 87.10 174.11 273.11 360.12 429.22 Xi and Yi form a complementary pair of nodes for ion i.

Problem Reformulation X0 X1 … Xk Yk … Y1 Y0 Let M(i, j) be a two-dimensional matrix with 0 ≤ i, j ≤ k. Let M(i, j)=1 if there exists a path L from X0 to Xi and a path R from Yj to Y0, such that L and R together contain exactly one of Xp and Yp for each P in [0, max{i, j}]. X1 X2 … … X0 Xi Yj Yi Y2 Y1 Y0 L R

Problem Reformulation There is a feasible path if and only if for some i and k, there is an edge e from Xi to Yk and M(i, k) = 1, or for some k and j, there is an edge e from Xk to Yj and M(k, j) = 1 X0 Xi Yk Y0 e L R X0 Xk Yj Y0 e L R

Compute M(i, j) by Dynamic Programming 1. Initialize M(0,0)=1 and M(i,j)=0 for all i and j. 2. Compute M(1,0) and M(0,1). 3. For j = 2 to k For i = 0 to j-2 (a) If M(i, j-1) = 1 and edge(Xi, Xj) = 1, then M(j, j-1) = 1. (b) If M(i, j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i, j) = 1. (c) If M(j-1, i) = 1 and edge(Xj-1, Xj) = 1, then M(j, i) = 1. (d) If M(j-1, i) = 1 and edge(Yj, Yi) = 1, then M(j-1, j) = 1. Extend L and R by one edge at a time. L R X0 Xi Xk Yk Y0

Compute M(i, j) by Dynamic Programming 3. For j = 2 to k For i = 0 to j-2 (a) If M(i, j-1) = 1 and edge(Xi, Xj) = 1, then M(j, j-1) = 1. (b) If M(i, j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i, j) = 1. (c) If M(j-1, i) = 1 and edge(Xj-1, Xj) = 1, then M(j, i) = 1. (d) If M(j-1, i) = 1 and edge(Yj, Yi) = 1, then M(j-1, j) = 1. X0 Xi Xj Yj Yj-1 Y0 (a) e L R X0 Xi Xj Yj Yj-1 Y0 (b) e L R

Compute M(i, j) by Dynamic Programming 3. For j = 2 to k For i = 0 to j-2 (a) If M(i, j-1) = 1 and edge(Xi, Xj) = 1, then M(j, j-1) = 1. (b) If M(i, j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i, j) = 1. (c) If M(j-1, i) = 1 and edge(Xj-1, Xj) = 1, then M(j, i) = 1. (d) If M(j-1, i) = 1 and edge(Yj, Yi) = 1, then M(j-1, j) = 1.

Compute M(i, j) by Dynamic Programming X0 X1 X2 Y2 Y1 Y0 For j = 2 to k For i = 0 to j-2 (a) If M(i, j-1) = 1 and edge(Xi, Xj) = 1, then M(j, j-1) = 1. (b) If M(i, j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i, j) = 1. (c) If M(j-1, i) = 1 and edge(Xj-1, Xj) = 1, then M(j, i) = 1. (d) If M(j-1, i) = 1 and edge(Yj, Yi) = 1, then M(j-1, j) = 1. 1 2 1 1 (i = 0, j = 2) O(V2)

Final Solution X0 X1 X2 Y2 Y1 Y0 Check each 1 (i.e., a feasible solution) in the last row and last column. Check if there is an edge connecting (Xk, Yi) or (Xi, Yk). Backtrack M to search each edge corresponding to the feasible solution. 1 2 1 1 O(V)

Speedup Encode M into two linear array, lce() and dia(), such that any entry of M can be computed in O(1). Let lce(z) be the length of the longest consecutive inside edges starting from node z (computable in O(V)). Let dia(z) be two diagonals in M (computable in O(E)). dia(xj) = M(j, j - 1) = 1iff there exists some i < j-1 such that M(i, j-1) = 1and E(xi, xj)=1. M(i, j-1) can be computed in constant time since dia(x0), …, dia(xj-1) and dia(yj-1), …, dia(y0) have been computed.

Speedup Let M(i, j) , i < j, be the entry we want to compute. If i = j-1, M(i, j) = dia(yj). If i < j-1, M(i, j) =1 when M(i, i+1) = 1 and E(yj, yj-1)=…=E(yi+2, yi+1), that is, lce(yj) ≥ j-i+1. Given G=(V, E). A feasible solution can be found in O(V+E) time and O(V) space.

Peptide sequencing In practice, a tandem mass spectrum contains noise and other types of ions. Thus, we don’t need to traverse all vertices of G. Mass / Charge Abundance (100%)

Algorithm for peptide sequencing Compute an NC-spectrum graph G. Construct a two-dimensional matrix Q using a scoring function s(.). High peaks (i.e., ions with high frequency) and edges labeled with single amino acid receive higher scores. Use Q to find a feasible solution.

Matrix Q Q(i, j) = maxL,R{s(L) + s(R)}, iff there is a path L from x0 to xi and a path R from yj to y0, such that at most one of xp and yp is in LR for every p[1, i] [1, j]. Q(i, j) = 0, otherwise.

Q(2,3)=7 3 4 x0 x1 x2 y3 y2 y1 y0 1 1

Q 1 2 X0 X1 X2 Y2 Y1 Y0

Dynamic programming Initial: Q(i, j) = 0 for all 0  i, j  k. For j = 1 to k If E(yj, y0) = 1, then Q(0. j) = max{Q(0, j), s(yj, y0)}; If E(x0, xj) = 1, then Q(j, 0) = max{Q(j, 0), s(x0, xj)}; For i = 0 to j - 1 (a) For every E(yj, yp) = 1 and Q(i, p) > 0, Q(i, j) = max{Q(i, j), Q(i, p) + s(yj, yp)}; (b) For every E(xp, xj) = 1 and Q(p, i) > 0, Q(j, i) = max{Q(j, i), Q(p, i) + s(xp, xj)}.

Illustration x0 xi yj yp y0

Illustration Q 1 2 3 1 2 3

Feasible solution  i, j, if Q(i, j) > 0 and E(xi, xj) = 1, compute max{Q(i, j) + s(xi, yj)}. Backtrack Q(p, q) to find all edges of the feasible solution.

Complexity Given G, Q can be compute in O(|V||E|) time. Given G, a feasible solution of G can be found in O(|V||E|) time and O(|V|2) space.

Algorithm for one-amino acid modification There are a few hundred known modifications. In most experiments, a protein is digested into multiple peptides, and most peptides have at most one modified amino acid.

Algorithm for one-amino acid modification The one-amino acid modification problem is equivalent to the problem which, given G = (V ,E), asks for two nodes vi and vj , such that E(vi, vj ) = 0 but adding the edge (vi, vj) to G creates a feasible solution that contains this edge.

Algorithm for one-amino acid modification Compute an NC-spectrum graph G. Construct a two-dimensional matrix N (suppose ideal). Use N and M to examine whether a chosen edge is a possible solution.

Matrix N N(i, j) = 1 if and only if there is a path from xi to yj which contains exactly one of xp and yp for every p  [i, k][j, k]. Let N(i, j) = 0, otherwise.

N 2 1 X0 X1 X2 Y2 Y1 Y0

Dynamic programming Initialize N(i, j) = 0 for all i and j ; Compute N(k, k – 1) and N(k – 1. k); For j = k - 2 to 0 For i = k to j + 2 (a) if N(i, j + 1) = 1 and E(xj, xi) = 1, then N(j, j + 1) = 1; (b) if N(i, j + 1) = 1 and E(yj+1, yj) = 1, then N(i, j) = 1; (c) if N(j + 1, i) = 1 and E(xj, xj+1) = 1, then N(j, i) = 1; (d) if N(j + 1, i) = 1 and E(yi, yj) = 1, then N(j + 1, j) = 1.

Illustration xj xi yj+1 yj

Illustration N 3 2 1 3 2 1

How to find feasible solution? Without lost of generality, we suppose that the modification be between two prefix nodes xi and xj with 0  i < j  k, and E(xi, xj)=0. There are five cases: i+1 < j. 1 < i+1 = j < k. 0 = i = j-1. i+1 = j = k. The modification is between xk and yj.

i+1 < j M(i, i+1)=1, N(j, i+1)=1 i+1 < j x0 xi xj yi+1 y0 O(|V|2)

1 < i+1 = j < k M(i, p)=1, N(j, q)=1, E(yq, yp)=1 x0 xi xj yq yj yi yp y0 O(|V||E|)

0 = i = j-1 E(yq, y0)=1, N(1,q)=1 xj xi x0 x1 yq y1 y0 O(|V|)

i+1 = j = k E(xk, yp)=1, M(k-1, p)=1 xj xi x0 xk-1 xk yk yk-1 yp y0 O(|V|)

The modification is between xk and yj E(xk, yj)=0, M(k, j)=1 x0 xk yj y0 O(|V|)

Complexity Given G, N can be computed in O(|V|2) time. Given G, all possible amino acid modifications can be found in O(|V||E|) time and O(|V|2) space.

Experimental results

Experimental results