(Journal of Computational Biology, 2001) (SODA, 2000)

(Journal of Computational Biology, 2001) (SODA, 2000)
A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry Ting Chen, Ming-Yang Kao, Matthew Tepel, John Rush, and George M. Church (Journal of Computational Biology, 2001) (SODA, 2000) Speaker: Yao-Ting Huang Hung-Lung Wang 2019/8/3

Introduction De novo peptide sequencing Mass Spectrometer
identifies peptide sequence without the help from a protein database, and is especially useful in the identification of unknown proteins. Mass Spectrometer is an instrument that measures the molecular weight of chemical compounds according to their mass-to-charge ratio. Tandem mass spectrometry (MS/MS) plays an important role in protein identification due to its fastness and high sensitivity.

Tandem Mass Spectrometry (MS/MS)
Electrospray Ionization (ESI) Enzyme Mass/Charge 1st Mass Spectrometer Ionized Peptides Proteins Peptides b-ions y-ions … Mass/Charge 2nd Mass Spectrometer One Peptide Fragmentation & Ionization De Novo Peptide Sequencing Protein Database Searching Kao’s illustration

Fragmentation and Ionization
H – N – C – C |H | R1 H | O || N – C – C |H | Ri H | O || H | O || H | O || … N – C – C … N – C – C – OH |H | Ri+1 |H | Rn b-ion (N-terminal) y-ion (C-terminal) + + H – N – C – C |H | R1 H | O || N – C – C |H | Ri H | O || H | O || H | O || … H – N – C – C … N – C – C – OH |H | Ri+1 |H | Rn

Fragmentation and Ionization
Given a peptide sequence: α S – W – R β α = H = 1 β = 2H + OH = 19 Prefix b-ion sequence Suffix y-ion sequence y ( S - W - R × + β ) 3 y ( W - R × ) + b ( α × S ) + β 2 1 y ( R × ) + b ( - + β α × S W ) 1 2 b ( S - - α × W R ) + 3 Complementary ion pairs: b1/y2 and b2/y1

Ideal Tandem Mass Spectrometry
S = 87.08, W = , R = y-ions R W S 88.033 Abundance (100%) b-ions S W R Mass / Charge All b-ions form a forward mass ladder All y-ions form a reverse mass ladder

Problem 1: Ideal De Novo Peptide Sequencing
88.033 Abundance (100%) Mass / Charge We do not know whether an ion is a b-ion or an y-ion.

Problem 2: Ideal De Novo Peptide Sequencing
Mass / Charge Abundance (100%) Some ions may be missing.

Noise and Amino acid Modification
Each ion has multiple isotopic forms e.g., C12 and C13. Ions other than b and y may appear e.g., a-ion and z-ion. Some ions may lose a water or an amino acid. Some ions may have multiple charges. Amino acid modification is an amino acid with slightly different atoms (and thus a different mass). Amino acid modifications are usually related to protein functions.

Problem 3: De Novo Peptide Sequencing
Abundance (100%) Mass / Charge Noise data appears in the mass spectrometry.

Related Works The mass spectrometry can be compared to the peptide database. SEQUEST (by Eng et al., 1994), Mascot (by Perkins et al., 1999), ProteinProspector (by Clauser et al., 1999) De novo peptide sequencing extracts candidate peptide sequences directly from the mass spectrometry. Dancik et al. (1999) create a directed acyclic graph called spectrum graph. A mass peak is transformed into several nodes and each node represents a possible prefix subsequence. An edge connects two nodes that differ by the total mass of some amino acids. Apply Cormen’s algorithm for finding longest path in the graph.

Result of This Paper Chen et al. observes that Dancik’s approach may tend to include multiple nodes associated with the same mass peak, which is rare in practice. Create a new NC-specturm graph G=(V, E), where V=2k+2 and k is the number of mass peaks (ions). The ideal de novo peptide sequencing problem can be solved in O(|V|+|E|) time and O(|V|) space. The de novo peptide sequencing problem can be solved in O(|V||E|) time and O(|V|2) space. A modified amino acid can be found in O(|V||E|) time.

Ideal De Novo Peptide Sequencing Problem
Input: the parent mass W of an unknown peptide P, k mass peaks (ions) I1, I2, …, Ik, and masses w1, w2, …, wk of these ions. Output: A peptide sequence Q such that a subset of its prefixes and suffixes gives the same mass peaks.

Construction of the NC-spectrum Graph
Create a pair of nodes, Nj and Cj, for each ion Ij . Create two auxiliary nodes N0 and C0. to represent the zero mass and parent mass, respectively. Let V = {N0 , N1 , …, Nk , C0 , C1 , …, Ck}. Each node x is placed at a real line and assigned coordinate cord(x) according to the total mass of its amino acids, that is, N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 429.22

W = Abundance (100%) Mass / Charge N0 C0 429.22

W = Abundance (100%) Mass / Charge N0 C1 N1 C0 174.11 273.11 429.22

W = Abundance (100%) Mass / Charge Solutions in Z1 includes that of Z2 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 429.22

Mass(S) = 87.08 S Mass(W) = W R Mass(R) = N0 N2 C1 N1 C2 C0 S+W Mass(S+W) = 87.10 174.11 273.11 360.12 429.22

87.10 174.11 273.11 360.12 429.22 Each path from N0 to C0 represents a possible sequence for the peptide A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

87.10 174.11 273.11 360.12 429.22 This is not a feasible path: miss ion I2

87.10 174.11 273.11 360.12 429.22 This is not a feasible path: repeat ion I1

87.10 174.11 273.11 360.12 429.22 This is a feasible path

Determine the Mass of Amino Acids
Input: the maximum mass h and mass precision δ. Output: a mass array A which takes an input of m and returns 1 is m equals the total mass of some amino acids The mass array A can be constructed in O(h/δ). A is computed from A[0] to A[h/δ] by assigning A[m]=1 iff m equals one amino acid mass or there exists an amino acid r such that A[m - r]=1. Since there are only 20 amino acids (i.e., |r|<=20), the running time is O(h/δ). The NC-spectrum graph G can be constructed in O(k2).

Problem Reformulation
Input: an NC-spectrum graph G. Output: a feasible path from N0 to C0. Difficulty: A longest path does not always go through exactly one of each pair of nodes. It is an NP-hard problem if the graph is a general directed graph.

Renaming Nodes Rename the nodes from left to right as X0 ,…, Xk ,Yk ,…,Y0 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 429.22 X0 X1 X2 Y2 Y1 Y0 87.10 174.11 273.11 360.12 429.22 Xi and Yi form a complementary pair of nodes for ion i.

X0 X1 … Xk Yk … Y1 Y0 Let M(i, j) be a two-dimensional matrix with 0 ≤ i, j ≤ k. Let M(i, j)=1 if there exists a path L from X0 to Xi and a path R from Yj to Y0, such that L and R together contain exactly one of Xp and Yp for each P in [0, max{i, j}]. X1 X2 … … X0 Xi Yj Yi Y2 Y1 Y0 L R

There is a feasible path if and only if for some i and k, there is an edge e from Xi to Yk and M(i, k) = 1, or for some k and j, there is an edge e from Xk to Yj and M(k, j) = 1 X0 Xi Yk Y0 e L R X0 Xk Yj Y0 e L R

Compute M(i, j) by Dynamic Programming
1. Initialize M(0,0)=1 and M(i,j)=0 for all i and j. 2. Compute M(1,0) and M(0,1). 3. For j = 2 to k For i = 0 to j-2 (a) If M(i, j-1) = 1 and edge(Xi, Xj) = 1, then M(j, j-1) = 1. (b) If M(i, j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i, j) = 1. (c) If M(j-1, i) = 1 and edge(Xj-1, Xj) = 1, then M(j, i) = 1. (d) If M(j-1, i) = 1 and edge(Yj, Yi) = 1, then M(j-1, j) = 1. Extend L and R by one edge at a time. L R X0 Xi Xk Yk Y0

3. For j = 2 to k For i = 0 to j-2 (a) If M(i, j-1) = 1 and edge(Xi, Xj) = 1, then M(j, j-1) = 1. (b) If M(i, j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i, j) = 1. (c) If M(j-1, i) = 1 and edge(Xj-1, Xj) = 1, then M(j, i) = 1. (d) If M(j-1, i) = 1 and edge(Yj, Yi) = 1, then M(j-1, j) = 1. X0 Xi Xj Yj Yj-1 Y0 (a) e L R X0 Xi Xj Yj Yj-1 Y0 (b) e L R

3. For j = 2 to k For i = 0 to j-2 (a) If M(i, j-1) = 1 and edge(Xi, Xj) = 1, then M(j, j-1) = 1. (b) If M(i, j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i, j) = 1. (c) If M(j-1, i) = 1 and edge(Xj-1, Xj) = 1, then M(j, i) = 1. (d) If M(j-1, i) = 1 and edge(Yj, Yi) = 1, then M(j-1, j) = 1.

X0 X1 X2 Y2 Y1 Y0 For j = 2 to k For i = 0 to j-2 (a) If M(i, j-1) = 1 and edge(Xi, Xj) = 1, then M(j, j-1) = 1. (b) If M(i, j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i, j) = 1. (c) If M(j-1, i) = 1 and edge(Xj-1, Xj) = 1, then M(j, i) = 1. (d) If M(j-1, i) = 1 and edge(Yj, Yi) = 1, then M(j-1, j) = 1. 1 2 1 1 (i = 0, j = 2) O(V2)

Final Solution X0 X1 X2 Y2 Y1 Y0 Check each 1 (i.e., a feasible solution) in the last row and last column. Check if there is an edge connecting (Xk, Yi) or (Xi, Yk). Backtrack M to search each edge corresponding to the feasible solution. 1 2 1 1 O(V)

Speedup Encode M into two linear array, lce() and dia(), such that any entry of M can be computed in O(1). Let lce(z) be the length of the longest consecutive inside edges starting from node z (computable in O(V)). Let dia(z) be two diagonals in M (computable in O(E)). dia(xj) = M(j, j - 1) = 1iff there exists some i < j-1 such that M(i, j-1) = 1and E(xi, xj)=1. M(i, j-1) can be computed in constant time since dia(x0), …, dia(xj-1) and dia(yj-1), …, dia(y0) have been computed.

Speedup Let M(i, j) , i < j, be the entry we want to compute.
If i = j-1, M(i, j) = dia(yj). If i < j-1, M(i, j) =1 when M(i, i+1) = 1 and E(yj, yj-1)=…=E(yi+2, yi+1), that is, lce(yj) ≥ j-i+1. Given G=(V, E). A feasible solution can be found in O(V+E) time and O(V) space.

Peptide sequencing In practice, a tandem mass spectrum contains noise and other types of ions. Thus, we don’t need to traverse all vertices of G. Mass / Charge Abundance (100%)

Algorithm for peptide sequencing
Compute an NC-spectrum graph G. Construct a two-dimensional matrix Q using a scoring function s(.). High peaks (i.e., ions with high frequency) and edges labeled with single amino acid receive higher scores. Use Q to find a feasible solution.

Matrix Q Q(i, j) = maxL,R{s(L) + s(R)},
iff there is a path L from x0 to xi and a path R from yj to y0, such that at most one of xp and yp is in LR for every p[1, i] [1, j]. Q(i, j) = 0, otherwise.

Q(2,3)=7 3 4 x0 x1 x2 y3 y2 y1 y0 1 1

Q 1 2 X0 X1 X2 Y2 Y1 Y0

Dynamic programming Initial: Q(i, j) = 0 for all 0  i, j  k.
For j = 1 to k If E(yj, y0) = 1, then Q(0. j) = max{Q(0, j), s(yj, y0)}; If E(x0, xj) = 1, then Q(j, 0) = max{Q(j, 0), s(x0, xj)}; For i = 0 to j - 1 (a) For every E(yj, yp) = 1 and Q(i, p) > 0, Q(i, j) = max{Q(i, j), Q(i, p) + s(yj, yp)}; (b) For every E(xp, xj) = 1 and Q(p, i) > 0, Q(j, i) = max{Q(j, i), Q(p, i) + s(xp, xj)}.

Illustration x0 xi yj yp y0

Illustration Q 1 2 3 1 2 3

Feasible solution  i, j, if Q(i, j) > 0 and E(xi, xj) = 1, compute max{Q(i, j) + s(xi, yj)}. Backtrack Q(p, q) to find all edges of the feasible solution.

Complexity Given G, Q can be compute in O(|V||E|) time.
Given G, a feasible solution of G can be found in O(|V||E|) time and O(|V|2) space.

Algorithm for one-amino acid modification
There are a few hundred known modifications. In most experiments, a protein is digested into multiple peptides, and most peptides have at most one modified amino acid.

The one-amino acid modification problem is equivalent to the problem which, given G = (V ,E), asks for two nodes vi and vj , such that E(vi, vj ) = 0 but adding the edge (vi, vj) to G creates a feasible solution that contains this edge.

Compute an NC-spectrum graph G. Construct a two-dimensional matrix N (suppose ideal). Use N and M to examine whether a chosen edge is a possible solution.

Matrix N N(i, j) = 1 if and only if there is a path from xi to yj which contains exactly one of xp and yp for every p  [i, k][j, k]. Let N(i, j) = 0, otherwise.

N 2 1 X0 X1 X2 Y2 Y1 Y0

Dynamic programming Initialize N(i, j) = 0 for all i and j ;
Compute N(k, k – 1) and N(k – 1. k); For j = k - 2 to 0 For i = k to j + 2 (a) if N(i, j + 1) = 1 and E(xj, xi) = 1, then N(j, j + 1) = 1; (b) if N(i, j + 1) = 1 and E(yj+1, yj) = 1, then N(i, j) = 1; (c) if N(j + 1, i) = 1 and E(xj, xj+1) = 1, then N(j, i) = 1; (d) if N(j + 1, i) = 1 and E(yi, yj) = 1, then N(j + 1, j) = 1.

Illustration xj xi yj+1 yj

Illustration N 3 2 1 3 2 1

How to find feasible solution?
Without lost of generality, we suppose that the modification be between two prefix nodes xi and xj with 0  i < j  k, and E(xi, xj)=0. There are five cases: i+1 < j. 1 < i+1 = j < k. 0 = i = j-1. i+1 = j = k. The modification is between xk and yj.

i+1 < j M(i, i+1)=1, N(j, i+1)=1 i+1 < j x0 xi xj yi+1 y0
O(|V|2)

1 < i+1 = j < k M(i, p)=1, N(j, q)=1, E(yq, yp)=1 x0 xi xj yq yj
yi yp y0 O(|V||E|)

0 = i = j-1 E(yq, y0)=1, N(1,q)=1 xj xi x0 x1 yq y1 y0 O(|V|)

i+1 = j = k E(xk, yp)=1, M(k-1, p)=1 xj xi x0 xk-1 xk yk yk-1 yp y0
O(|V|)

The modification is between xk and yj
E(xk, yj)=0, M(k, j)=1 x0 xk yj y0 O(|V|)

Complexity Given G, N can be computed in O(|V|2) time.
Given G, all possible amino acid modifications can be found in O(|V||E|) time and O(|V|2) space.

Experimental results

(Journal of Computational Biology, 2001) (SODA, 2000)

Similar presentations

Presentation on theme: "(Journal of Computational Biology, 2001) (SODA, 2000)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(Journal of Computational Biology, 2001) (SODA, 2000)

Similar presentations

Presentation on theme: "(Journal of Computational Biology, 2001) (SODA, 2000)"— Presentation transcript:

Similar presentations

About project

Feedback