1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A

2 Acknowledgments This talk is based on joint work with colleagues & students at Yale University: Computer Science: Jim Aspnes Gauri Shah Biology: Julia Hartling Junhyong Kim

3 Dual Purposes of This Talk 1.Discuss protein folding problems. 2.Emphasize the point that as bioinformatics grows, advanced algorithmic techniques will become useful and crucial.

4 Importance of Protein Folding The 3D structure significantly determines the function.

5 Two Complementary Problems for Protein Folding 1.Protein Folding Prediction --- Given a protein sequence, determine the 3D folding of the sequence. 2.Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure.

6 Complexity for Protein Folding Problems Protein Folding Prediction --- Given a protein sequence, determine the 3D folding of the sequence. NP-hard under various models. Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure. Solvable in polynomial time under the Grand Canonical model.

7 History of Protein Sequence Design Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure. 1.Sun et al, 1995: Heuristic search without optimality guarantee. 2.Hart, 1997: Open question on the computational tractability. 3.Kleinberg, 1999: Polynomial-time algorithms. 4.Aspnes, Hartling, Kao, Kim, Shah, 2001: Improved algorithms and generalized problems. this talk

8 Outline of Technical Discussions The Grand Canonical Model Two Basic Computational Problems Experimental Results Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others Further Algorithmic & Computational Hardness Results Conclusions

9 Outline of Technical Discussions (1) The Grand Canonical Model Two Basic Computational Problems Experimental Results Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others Further Algorithmic & Computational Hardness Results Conclusions

10 Grand Canonical Model (Sun et al, 1995) Each amino acid is classified as Hydrophobic (H) and Polar (P). Each amino acid sequence is then considered as a binary sequence of H and P. (For mathematical convenience, set H = 1 and P = 0). Hydrophobic (H): A, C, F, I, L, M, V, W, Y. Polar (P): the other amino acids. Sun, Brem, Chan, Dill. Designing amino acid sequences to fold with good hydrophobic cores. Protein Engineering, 1995.

11 Representation of a 3D structure: (Sun et al, 1995) A 3D folding structure S of n amino acid sequence: the coordinate of each atom in S. 1.the pairwise distances between the centers of amino acid residues in S. 2.the solvent-accessible areas of the amino acid residues in S.

12 Goal of Protein Sequence Design: (Sun et al, 1995) Input: A 3D structure S and a sequence length n. Output: a sequence X of n amino acids that, when folded into S, has the following properties: 1.The H-residues in X are as close to each other as possible. 2.The solvent-accessible areas of the H-residues of X are as small as possible.

13 Fitness of a Sequence (Sun et al, 1995)

14 Fitness of a Sequence (Sun et al, 1995) closeness among H-residues small surface area

16 Problem #1 Input: 1.the parameters alpha and beta, 2.a protein sequence Y, 3.Y’s 3D structure, 4.the sequence length n of Y. Output: a fittest sequence X for the 3D structure with respect to the given alpha and beta. Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

17 Problem #2 Input: 1.the parameters alpha and beta, 2.a protein sequence Y, 3.Y’s 3D structure, 4.the sequence length n of Y. Output: a fittest sequence X for the 3D structure that is the most similar to Y over all possible alpha and beta. Applications of this problem: tune the alpha and beta of the Grand Canonical model.

18 Basic Computational Scheme (1) 3D structure network a min cut HPPPHHPHP a fittest sequence

19 Problem #1 Input: 1.the parameters alpha and beta, 2.a protein sequence Y, 3.Y’s 3D structure, 4.the sequence length n of Y. Output: a fittest sequence X for the 3D structure with respect to the given alpha and beta. Applications of this problem: Design the best sequences for novel structures because we don’t really need Y. Computational Complexity: 1 network flow.

20 Problem #2 Input: 1.the parameters alpha and beta, 2.a protein sequence Y, 3.Y’s 3D structure, 4.the sequence length n of Y. Output: a fittest sequence X for the 3D structure that is the most similar to Y over all possible alpha and beta. Applications of this problem: tune the alpha and beta of the Grand Canonical model. Computational Complexity: O(n) network flows.

22 Empirical Study: Predictive Ability 1.Computed Fittest Sequence versus Native Sequences (% similarity) 2.Our % Similarity versus Kleinberg’s 3.% Similarity versus Protein Family Size.

23 % similarity --- computed versus native 1.% similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence. 2.The average percentage of the hydrophobic residues is 42% in the native sequences that were studied. 3.The best sequence picked without “domain knowledge” would have a 58% similarity on average.

24 % similarity --- computed versus native (1)

25 % similarity --- computed versus native (2) Our results versus Kleinberg’s

26 % similarity --- computed versus native (3)

27 % similarity versus PFAM family size (1) 1.% similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence. 2.PFAM family size of a protein = # of proteins in the PFAM database that are related to the given protein. The relatedness is computed via HMM models. pfam.wustl.edu measure of success of a protein in Nature.

28 % similarity versus PFAM family size (2) 1.% similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence. 2.PFAM family size of a protein = # of proteins in the PFAM database that are related to the given protein. Intuition/Conjecture: (3A) the more diverse a protein family is, (3B) the more its 3D structures vary, (3C) the smaller the % similarity will be.

29 % similarity versus PFAM family size (3)

30 % similarity versus PFAM family size (4)

32 Tool #1: Linear Programming Goal: find a fittest sequence X of n amino acids. find a binary sequence x that minimizes find x and y that 1.Linear 2.Totally unimodular 3.Integer solution 4.Useful for proving theorems 5.Still too inefficient clueless! quadratic

33 Tool #2: Network Flow (1) analogy: a network of oil pipes 1.source s (origin of oil) 2.sink t (destination of oil) 3.other nodes (midway stations) 4.arcs (pipes) 5.arc capacity (pipe capacity) 6.flow (amount of oil through a pipe) goal: deliver max amount of oil from source to sink computational goal: a max flow computational complexity: VE log (V 2 /E) 14 4 5 5 4 5 1 9 20 8 10 s t

34 Tool #2: Network Flow (2) example of max flow 1.source (origin of oil) 2.sink (destination of oil) 3.other nodes (midway stations) 4.arcs (pipes) 5.arc capacity (pipe capacity) 6.flow (amount of oil through a pipe) goal: deliver max amount of oil from source to sink computational goal: a max flow computational complexity: VE log (V 2 /E) 14 (1) 14 (14)4 (4) 5 (5) 5 (4) 4 (4) 5 1 (1) 9 (9) 20 8 (5) 10 (5) s t

35 Tool #2: Network Flow (3) max flow versus min cut 1.min cut  bottleneck 2.a partition (S,T) of nodes with s in S and t in T. 3.total capacity of arcs from S to T = max flow. 14 (1) 14 (14)4 (4) 5 (5) 5 (4) 4 (4) 5 1 (1) 9 (9) 20 8 (5) 10 (5) s t

36 Tool #2: Network Flow (4) max flow versus min cut 1.min cut  bottleneck 2.a partition (S,T) of nodes with s in S and t in T. 3.total capacity of arcs from S to T = max flow. computational complexity: VE log (V 2 /E) 14 (1) 14 (14)4 (4) 5 (5) 5 (4) 4 (4) 5 1 (1) 9 (9) 20 8 (5) 10 (5) s t

38 Tool #2: 3D  Network (1) 6 5 4 7 8 9 123 S 1 = 3 S 2 = 18 S 3 = 6 S 4 = 9 S 5 = 3 S 6 = 9 S 7 = 6 S 8 = 24 S 9 = 9 g(d 16 ) = 0.5 g(d 25 ) = 0.75 g(d 58 ) = 0.9 g(d 49 ) = 0.75 alpha = -8 beta = 1/3

39 Tool #2: 3D  Network (2) 6 5 4 7 8 9 123 S 1 = 3 S 2 = 18 S 3 = 6 S 4 = 9 S 5 = 3 S 6 = 9 S 7 = 6 S 8 = 24 S 9 = 9 g(d 16 ) = 0.5 g(d 25 ) = 0.75 g(d 58 ) = 0.9 g(d 49 ) = 0.75 alpha = -8 beta = 1/3 1 2 3 4 5 6 7 8 9 1 6 3 1 3 2 8 3 2 1,6 2,5 5,8 4,9 4 6 7.2 6 -alpha*g(d ij ) beta*s i

40 Tool #2: 3D  Network (3) 6 5 4 7 8 9 123 S 1 = 3 S 2 = 18 S 3 = 6 S 4 = 9 S 5 = 3 S 6 = 9 S 7 = 6 S 8 = 24 S 9 = 9 g(d 16 ) = 0.5 g(d 25 ) = 0.75 g(d 58 ) = 0.9 g(d 49 ) = 0.75 alpha = -8 beta = 1/3 1 2 3 4 5 6 7 8 9 1 6 3 1 3 2 8 3 2 1,6 2,5 5,8 4,9 4 6 7.2 6 -alpha*g(d ij ) beta*s i

41 Tool #2: 3D  Network (4) 6 5 4 7 8 9 123 1 2 3 4 5 6 7 8 9 1 6 3 1 3 2 8 3 2 1,6 2,5 5,8 4,9 4 6 7.2 6 -alpha*g(d ij ) beta*s i Theorem (Kleinberg, 1999) The amino acids that are with the source in a min cut are H’s.

43 Problem #1 Input: 1.the parameters alpha and beta, 2.a protein sequence Y, 3.Y’s 3D structure, 4.the sequence length n of Y. Output: a fittest sequence X for the 3D structure with respect to the given alpha and beta. Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

44 Tool #3: Linear Size Representation of All Min Cuts (1) 14 (1) 14 (14)4 (4) 5 (5) 5 (4) 4 (4) 5 1 (1) 9 (9) 20 8 (5) 10 (5) s t Step 1: Compute a max flow of G. Step 2: Compute the residual network G’. Step 3: Contract every strongly connected component into a super node. Call the new graph G”. Def: A node subset U of G” is a closed set if for every node x in U, every descendant of x is also in U. Theorem: (Picard and Queyranne, 1980) Every closed set not including the sink forms a min cut, and vice versa. v1 v2 v3 v4 v5 v7 v6

45 Tool #3: Linear Size Representation of All Min Cuts (2) 13 144 5 4 4 5 1 9 20 5 s t v1 v2 v3 v4 v5 v7 v6 1 3 5 1 5 Residual Network

46 Tool #3: Linear Size Representation of All Min Cuts (3) 5 s t v1 v2 v3 v4 v5 v7 v6 Picard-Queyranne Representation

47 Tool #3: Linear Size Representation of All Min Cuts (4) 5 s t v1 v2 v3 v4 v5 v7 v6 Picard-Queyranne Representation Applications: 1.Obtain all fittest sequences. 2.Study the landscape of the fittest sequences. 3.Compute fittest sequences with additional optimization objectives.

48 Basic Computational Scheme (2) 3D structure network a max flow/min cut the space of all fittest sequences HPPPHHPHP Picard-Queyranne Representation

50 Problem #3 Input: a 3D structure. Output: all its fittest protein sequences. Computational Complexity: (A) A linear size representation can be computed with 1 network flow. (B) Each individual fittest protein sequences can be generated from this representation in O(n) time.

51 Problem #4 Input: f 3D structures. Output: the set of all protein sequences that are the fittest simultaneously for all these 3D structures. Computational Complexity: f network flows.

52 Problem #5 Input: a protein sequence Y and its native 3D structure. Output: the set of all fittest protein sequences that are also the most (or least) similar to Y in terms of unweighted (or weighted) Hamming distances. Computational Complexity: 1 network flow.

53 Problem #6 Input: a 3D structure. Output: Count the number of protein sequences in the solution to each of Problems #3, #4, and #5. Computational Complexity: #P-complete.

54 Problem #7 Input: a 3D structure and a bound e. Output: Enumerate the protein sequences whose fitness function values are within an additive factor e of that of the fittest protein sequences. Computational Complexity: polynomial time to generate each desired protein sequence.

55 Problem #8 Input: a 3D structure. Output: the largest possible unweighted (or weighted) Hamming distance between any two fittest protein sequences. Computational Complexity: 1 network flow.

56 Problem #9 Input: a protein sequence Y and its native 3D structure. Output: the average unweighted (or weighted) Hamming distance between Y and the fittest protein sequences for the 3D structure. Computational Complexity: #P-complete.

57 Problem #10 Input: a protein sequence Y, its native 3D structure, and two unweighted Hamming distances d 1 and d 2. Output: a fittest protein sequence whose distance from Y is also between d 1 and d 2. Computational Complexity: NP-hard.

58 Problem #11 Input: a protein sequence Y, its native 3D structure, and an unweighted Hamming distance d. Output: the fittest among the protein sequences which are at distance d from Y. Computational Complexity: NP-hard. We have a polynomial-time approximation algorithm.

59 Problem #12 Input: a protein sequence Y and its native 3D structure Output: all the ratios between the scaling factors alpha and beta in the GC model such that the smallest possible unweighted (or weighted) Hamming distance between Y and any fittest protein sequence is minimized over all possible alpha and beta. Computational Complexity: O(n) network flows.

60 Problem #13 Input: a 3D structure. Output: Determine whether the fittest protein sequences are connected, i.e., whether they can mutate into each other through allowable mutations, such as point mutations, while the intermediate protein sequences all remain the fittest. Computational Complexity: 1 network flow.

61 Problem #14 Input: a 3D structure and two fittest protein sequences. Output: Determine whether the two sequences are connected. Computational Complexity: 1 network flow.

62 Problem #15 Input: a 3D structure. Output: the smallest set of allowable mutations with respect to which the fittest protein sequences (or two given fittest protein sequences) for the structure are connected. Computational Complexity: 1 network flow.

64 Further Research for Protein Sequence Design 1.More sophisticated models (biology). 2.Algorithms and complexity for such models (computer science). 3.Web lab validation (biology).

65 Further Algorithmic Research for Bioinformatics Current State of Bioinformatics: 1.Biology: mostly very simple heuristics 2.Algorithms: mostly very simple techniques Conjectures: 1.Biology: Nature is not so simple. Most of the biological information is very complicated. 2.Algorithms: Very sophisticated, novel, and fundamental techniques will be needed to unlock Nature’s secrets.

1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

Similar presentations

Presentation on theme: "1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

Similar presentations

Presentation on theme: "1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback