Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry Ming-Yang Kao Department of Computer Science Northwestern University Evanston,

Similar presentations


Presentation on theme: "1 An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry Ming-Yang Kao Department of Computer Science Northwestern University Evanston,"— Presentation transcript:

1 1 An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A.

2 2 Collaborators of This Project University of Southern California Ting Chen Harvard Medical School George M. Church John Rush Matthew Tepel

3 3 Perspectives A key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes. Genome: entire sets of materials in the chromosomes. Transcriptome: entire sets of gene transcripts. Proteome: entire sets of proteins. Genome (DNA)  Transcriptome (RNA)  Proteome (Protein)

4 4 Perspectives A key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes. Genome: entire sets of materials in the chromosomes. Transcriptome: entire sets of gene transcripts. Proteome: entire sets of proteins. Genome (DNA)  Transcriptome (RNA)  Proteome (Protein) this talk’s focus

5 5 Proteomics Proteome: all proteins encoded within a genome –half millions distinct proteins (temporal, spatial, modifications) –~30,000 human genes –mRNA and protein expressions may not correlate Proteomics: study of protein expression by biological systems –relative abundance and stability; post-translational modifications –fluctuations as a response to environment and altered cellular needs –correlations between protein expression and disease state –protein-protein interactions, protein complexes Technologies: –2D gel electrophoresis –mass spectrometry –yeast two-hybrid system –protein chips this talk’s focus

6 6 A Key Step of Proteomics How to sequence proteins? How to sequence protein peptides? (this talk’s focus)

7 7 Outline of This Talk 1.Problem Formulation (Biology) 2.Problem Formulation (Computer Science) 3.Basic Computational Techniques 4.Improved Computational Complexity and More Robust Algorithms 5.Conclusions

8 8 Outline of This Talk (1) 1.Problem Formulation (Biology) 2.Problem Formulation (Computer Science) 3.Basic Computational Techniques 4.Improved Computational Complexity and More Robust Algorithms 5.Conclusions

9 9 Protein Identification: HPLC-MS-MS Mass/Charge Tandem Mass Spectrum Mass/Charge Proteins Peptides One PeptideB-ions / Y-ions

10 10 Protein Identification: HPLC-MS-MS Mass/Charge Tandem Mass Spectrum Mass/Charge Proteins Peptides One Peptide B-ions / Y-ions

11 11 Peptide Fragmentation and Ionization B-ionY-ion Complementary: Mass(B-ion)+Mass(Y-ion) = Mass(peptide)+4H+O

12 12 B-ions and Y-ions Fragmentation

13 13 Tandem Mass Spectrum Mass / Charge Abundance (100%) 200 50 88.033 100 400 175.113 274.112 361.121 430.213 448.225

14 14 Raw Tandem Mass Spectrum

15 15 Prediction from Raw Tandem Mass Spectrum

16 16 Protein Database Search Find the peptide sequences in a protein database that optimally fit the spectrum. It does not work if the target peptide sequence is not in the database. It does not work if there is an unknown modification at some amino acid. It is very slow because it must search the entire database. E.g., SEQUEST, Yates, Univ. of Washington.

17 17 De Novo Peptide Sequencing Problem Input: (1) the mass W of an unknown target peptide, and (2) a set S of the masses of some or all b-ions and y-ions of the peptide. Output: a peptide P such that (1) mass(P)=W and (2) S is a subset of all the ion masses of P. Mass / Charge Abundance (100%) 50 100 274.112 361.121 Peptide Mass 429.212 Daltons P = SWR, Mass(P) = 429.212, Ions(P) = { 88.033, 175.113, 274.112, 361.121, 430.213, 448.225 }

18 18 Tandem Mass Spectrum Mass / Charge Abundance (100%) 200 50 88.033 100 400 175.113 274.112 361.121 430.213 448.225 Peptide Mass 429.212 Daltons

19 19 Amino Acid Mass Table

20 20 Feature 1 All B-ions form a forward mass ladder. Mass / Charge Abundance (100%) 200 50 88.033 100 400 175.113 274.112 361.121 430.213 448.225 S W R Peptide Mass 429.212 Daltons b1b1 b2b2 b3b3 1

21 21 Feature 2 All Y-ions form a reverse mass ladder. Mass / Charge Abundance (100%) 200 50 88.033 100 400 175.113 274.112 361.121 430.213 448.225 S W R R W S Peptide Mass 429.212 Daltons y1y1 y2y2 y3y3 19

22 22 Basic Difficulty #1 It is unknown whether an ion is a B-ion or an Y-ion. Mass / Charge Abundance (100%) 200 50 88.033 100 400 175.113 274.112 361.121 430.213 448.225 Peptide Mass 429.212 Daltons

23 23 Basic Difficulty #2 There are missing ions. Mass / Charge Abundance (100%) 200 50 100 400 274.112 361.121 Ion 1 Ion 2 Peptide Mass 429.212 Daltons

24 24 Feature 3 (to our Rescue) Complementary Ion Pairs: b 1 / y 2 and b 2 / y 1 Mass / Charge Abundance (100%) 200 50 88.033 100 400 175.113 274.112 361.121 430.213 448.225 S W R R W S Peptide Mass 429.212 Daltons y1y1 y2y2 y3y3 b1b1 b2b2 b3b3

25 25 Outline of This Talk (2) 1.Problem Formulation (Biology) 2.Problem Formulation (Computer Science) 3.Basic Computational Techniques 4.Improved Computational Complexity and More Robust Algorithms 5.Conclusions

26 26 Formulating the Computational Problem 1.T = an alphabet of 20 characters a 1,a 2,…,a 20. 2.two special characters: alpha and beta. 3.the mass of alpha = 1, the mass of beta = 19, the mass of a i is m i. 4.A peptide sequence is x 1,x 2,x 3,…,x n-1,x n, where each x i is from T. 5.A b-ion is x 0,x 1,x 2,…,x i for some 1 <= i <= n, where x 0 = alpha. 6.A y-ion is x i,…,x n-2,x n-1,x n, x n+1 for some 1 <= i <= n, where x n+1 = beta.

27 27 De Novo Peptide Sequencing Problem Input: (1) the mass W of an unknown target peptide, and (2) a set S of the masses of some or all b-ions and y-ions of the peptide. Output: a peptide P such that (1) mass(P)=W and (2) S is a subset of all the ion masses of P. Mass / Charge Abundance (100%) 50 100 274.112 361.121 Peptide Mass 429.212 Daltons P = SWR, Mass(P) = 429.212, Ions(P) = { 88.033, 175.113, 274.112, 361.121, 430.213, 448.225 }

28 28 Amino Acid Mass Table

29 29 Outline of This Talk (3) 1.Problem Formulation (Biology) 2.Problem Formulation (Computer Science) 3.Basic Computational Techniques 4.Improved Computational Complexity and More Robust Algorithms 5.Conclusions

30 30 peptide mass Wtandem mass spectrum S NC-spectrum graph Find feasible paths to order the masses in S to identify all the b-ions and y-ions consistent with S. Basic Computing Scheme Convert feasible paths into legal peptide sequences

31 31 NC-Spectrum Graph: Nodes (1) 0 429.22 N0N0 C0C0 mass of this peptide

32 32 NC-Spectrum Graph: Nodes (2) mass of this peptide 0 429.22 N0N0 C0C0 174.11273.11 mass( ) + mass( ) = mass(P) + 18 Ion # 1 (274.11) Assumption 1: If Ion 1 is an y-ion C1: a b-ion node Assumption 2: If Ion 1 is a b-ion N1: a b-ion node C1C1 N1N1

33 33 NC-Spectrum Graph: Nodes (3) 0 429.22 N0N0 C0C0 174.11273.11 mass( ) + mass( ) = mass(P) + 18 Ion # 2 (88.10) 87.10360.12 C1C1 N1N1 C2C2 N2N2

34 34 NC-Spectrum Graph: Edges (1) 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2 Mass(S) = 87.08. S

35 35 NC-Spectrum Graph: Edges (2) 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2 Mass(S) = 87.08. S Mass(W) = 186.21 W

36 36 NC-Spectrum Graph: Edges (3) 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2 Mass(S) = 87.08. S Mass(W) = 186.21 W S+W Mass(S+W) = 273.29

37 37 NC-Spectrum Graph: Edges (4) 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2 Mass(S) = 87.08. S Mass(W) = 186.21 W S+W Mass(S+W) = 273.29 R Mass(R) = 156.19

38 38 NC-Spectrum Graph 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2

39 39 NC-Spectrum Graph: Paths = Sequences 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2 S WR b-ions

40 40 NC-Spectrum Graph: A Feasible Path (1) 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2 Definition: A feasible path is a path from N 0 to C 0 that goes through exactly one node for each pair (either N j or C j ). a feasible path S WR b-ions

41 41 NC-Spectrum Graph: A Feasible Path (2) 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2 Definition: A feasible path is a path from N 0 to C 0 that goes through exactly one node for each pair (either N j or C j ). a feasible path SS GVV b-ions y-ions

42 42 NC-Spectrum Graph: Not A Feasible Path (1) 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2 Definition: A feasible path is a path from N 0 to C 0 that goes through exactly one node for each pair (either N j or C j ). not a feasible path: (1)miss ion #2

43 43 NC-Spectrum Graph: Not A Feasible Path (2) 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2 Definition: A feasible path is a path from N 0 to C 0 that goes through exactly one node for each pair (either N j or C j ). not a feasible path: (2) repeat ion #1

44 44 NC-Spectrum Graph: Not A Feasible Path (3) 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2 Definition: A feasible path is a path from N 0 to C 0 that goes through exactly one node for each pair (either N j or C j ). not a feasible path: (1)miss ion #2 (2)repeat ion #1

45 45 Reformulating the De Novo Peptide Sequencing Problem Input: an NC-spectrum graph G. Output: a feasible path from N 0 to C 0.

46 46 Observations A longest path does not always go through exactly one of each pair of nodes. It is an NP-hard problem if the spectrum graph is a general directed graph.

47 47 Basic Algorithm Input: a peptide mass W and a tandem mass spectrum S. Output: a feasible peptide sequence. Steps: 1.Compute the nodes of the NC-spectrum graph G. 2.Compute the edges of G. 3.Compute a feasible path P in G. 4.Convert P into a feasible sequence.

48 48 Basic Algorithm (1) Input: a peptide mass W and a tandem mass spectrum S. Output: a feasible peptide sequence. Steps: 1.Compute the nodes of the NC-spectrum graph G. 2.Compute the edges of G. 3.Compute a feasible path P in G. 4.Convert P into a feasible sequence.

49 49 Compute the Nodes of the NC-Spectrum Graph Step 2. Rename the nodes from left to right as X 0,…, X k,Y k,…,Y 0 0 429.22 X0X0 Y0Y0 174.11273.1187.10360.12 X2X2 Y2Y2 Y1Y1 X1X1 0 429.22 N0N0 C0C0 174.11273.1187.10360.12 C1C1 N1N1 C2C2 N2N2 Step 1. Compute the nodes and place them in the increasing order of masses. Observation: X i and Y i form a complementary pair of nodes N i and C i for ion i. Running Time: O(k), where k = # of masses in the spectrum.

50 50 Basic Algorithm (2) Input: a peptide mass W and a tandem mass spectrum S. Output: a feasible peptide sequence. Steps: 1.Compute the nodes of the NC-spectrum graph G. 2.Compute the edges of G. inverse of each other 3.Compute a feasible path P in G. 4.Convert P into a feasible sequence.

51 51 Compute the Edges of the NC-Spectrum Graph 0 429.22 X0X0 Y0Y0 174.11273.1187.10360.12 X2X2 Y2Y2 Y1Y1 X1X1 Basic Question: Given a mass u, is there a protein sequence with that mass? Solution: dynamic programming via a Boolean array E( ). 1.precision = 0.01. 2.Boolean array length L = peptide mass W / precision. 3.Boolean array E(u/0.01) = 1 if u is the mass of a peptide; otherwise 0. 4.dynamic programming E(j) = 1 if only E(j – m i ) =1 for some amino acid mass m i. 5.Running Time: (1) Computing E() takes O(L) time; or O(L/log L) via 4-Russian preprocessing. (2) Computing the edges takes O(k^2) time.

52 52 Basic Algorithm (3) Input: a peptide mass W and a tandem mass spectrum S. Output: a feasible peptide sequence. Steps: 1.Compute the nodes of the NC-spectrum graph G. 2.Compute the edges of G. 3.Compute a feasible path P in G. 4.Convert P into a feasible sequence.

53 53 Compute a Feasible Path (1) 0 429.22 X0X0 Y0Y0 87.10360.12 Y1Y1 X1X1 0 429.22 X0X0 Y0Y0 174.11273.1187.10360.12 X2X2 Y2Y2 Y1Y1 X1X1 Recursion: Use the feasible paths of X 0,…, X i,Y j,…,Y 0 to compute the feasible paths of X 0,…, X i, X i+1,Y j+1,Y j,…,Y 0. Dynamic Programming: M(i,j) = 1 if there exist a path PL from X 0 to X i and a path PR from Y j to Y 0 such that PL and PR together contain exactly one of X q and Y q for each q = 0, …, max{i,j}. Observation: There is a feasible path if and only if (1) for some i and k, there is an edge e from X i to Y k and M(i,k) = 1, or (2) for some k and j, there is an edge e from X k to Y j and M(k,j) = 1

54 54 Compute a Feasible Path (2) Dynamic Programming: M(i,j) = 1 if there exist a path PL from X 0 to X i and a path PR from Y j to Y 0 such that PL and PR together contain exactly one of X q and Y q for each q = 0, …, max{i,j}. Observation: There is a feasible path if and only if (1) for some i and k, there is an edge e from X i to Y k and E(i,k) = 1, or (2) for some k and j, there is an edge e from X k to Y j and E(k,j) = 1 X0X0 Y0Y0 YkYk XiXi PLPR e X0X0 Y0Y0 XkXk PLPR e YjYj

55 55 Compute a Feasible Path (3) Dynamic Programming: M(i,j) = 1 if there exist a path PL from X 0 to X i and a path PR from Y j to Y 0 such that PL and PR together contain exactly one of X q and Y q for each q = 0, …, max{i,j}. Base Case: M(0,0), M(0,1), M(1,0). Recurrence: (1)If M(i,j-1) = 1 and edge(X i, X j ) = 1, then M(j,j-1) = 1. (2)If M(i,j-1) = 1 and edge(Y j, Y j-1 ) = 1, then M(i,j) = 1. (3)If M(j-1,i) = 1 and edge(X j-1, X j ) = 1, then M(j,i) = 1. (4)If M(j-1,i) = 1 and edge(Y j, Y i ) = 1, then M(j-1,j) = 1. Idea: Extend PL and PR by one edge at a time.

56 56 Compute a Feasible Path (4) Dynamic Programming: M(i,j) = 1 if there exist a path PL from X 0 to X i and a path PR from Y j to Y 0 such that PL and PR together contain exactly one of X q and Y q for each q = 0, …, max{i,j}. Recurrence: (1)If M(i,j-1) = 1 and edge(X i, X j ) = 1, then M(j,j-1) = 1. (2)If M(i,j-1) = 1 and edge(Y j, Y j-1 ) = 1, then M(i,j) = 1. (3)If M(j-1,i) = 1 and edge(X j-1, X j ) = 1, then M(j,i) = 1. (4)If M(j-1,i) = 1 and edge(Y j, Y i ) = 1, then M(j-1,j) = 1. X0X0 Y0Y0 Y j-1 XiXi PLPR e XjXj YjYj X0X0 Y0Y0 Y j-1 XiXi PLPR e XjXj YjYj

57 57 Compute a Feasible Path (5) Dynamic Programming: M(i,j) = 1 if there exist a path PL from X 0 to X i and a path PR from Y j to Y 0 such that PL and PR together contain exactly one of X q and Y q for each q = 0, …, max{i,j}. Recurrence: (1)If M(i,j-1) = 1 and edge(X i, X j ) = 1, then M(j,j-1) = 1. (2)If M(i,j-1) = 1 and edge(Y j, Y j-1 ) = 1, then M(i,j) = 1. (3)If M(j-1,i) = 1 and edge(X j-1, X j ) = 1, then M(j,i) = 1. (4)If M(j-1,i) = 1 and edge(Y j, Y i ) = 1, then M(j-1,j) = 1. Computational Complexity: O(k^2).

58 58 Algorithmic Result #1: Finding a Feasible Path Input: an NC-Spectrum Graph G=(V,E) Output: a feasible path in G. Computational Complexity: O(|V| 2 ) time & O(|V| 2 ) space.

59 59 Outline of This Talk (4) 1.Problem Formulation (Biology) 2.Problem Formulation (Computer Science) 3.Basic Computational Techniques 4.Improved Computational Complexity and More Robust Algorithms 5.Conclusions

60 60 Algorithmic Result #2: Finding a Feasible Path (Improved) Input: an NC-spectrum graph G=(V,E). Output: A feasible path can be found in O(|V|+|E|) time. Idea: Speed up via pre-processing.

61 61 Amino Acid Modifications A modification is an amino acid with slightly different atoms (and thus a different mass) from the typical molecule. Importance of modifications: Amino acid modifications are related to functions. For example, a protein is active when phosphorylated and inactive when de-phosphorylated.

62 62 Modification in the Tandem Mass Spectrum Mass / Charge Abundance (100%) 200 50 100 400 S W+d R R S

63 63 Spectrum Graph: As Before

64 64 Spectrum Graph: A Modified Feasible Path Idea: One mass change leads to one missing edge.

65 65 Algorithmic Result #3: Finding One Modification A modification is an amino acid with slightly different atoms (and thus a different mass) from the typical molecule. Theorem: Finding the position of the modification takes O(|V|+|E|) space and O(|V| |E|) time.

66 66 Algorithmic Result #4: Noisy Data Define a scoring function s(): –s(edge) = function(mass). –s(node) = function(abundance). Redefine the problem: Find the maximum score path that goes through at most one node for each ion. Solution: dynamic programming in O(|V|+|E|) space and O(|V| |E|) time.

67 67 Outline of This Talk (5) 1.Problem Formulation (Biology) 2.Problem Formulation (Computer Science) 3.Basic Computational Techniques 4.Improved Computational Complexity and More Robust Algorithms 5.Conclusions

68 68 Further Difficulties for Tandem Mass Spectrum Interpretation Each ion has a couple of isotopic forms. Other ions (a or z) may appear. Some ions may lose a water or an ammonia. Multiple ion charges. Noise. Amino acid modifications.

69 69 Further Research Directions Efficient algorithms to deal with more modifications in conjunction with data noise. Efficient algorithms to combine de novo peptide sequencing with peptide database search. Efficient algorithms to assess statistical significance of feasible peptide sequences. Efficient algorithms to deal with multiple peptides. Practical implementation; speed-up via preprocessing. More …

70 70 Further Research Directions Looking for top-rate graduate students for this project (and other projects). Immediate and expedited admission for the coming fall semester.


Download ppt "1 An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry Ming-Yang Kao Department of Computer Science Northwestern University Evanston,"

Similar presentations


Ads by Google