1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

Slides:



Advertisements
Similar presentations
Lecture 7. Network Flows We consider a network with directed edges. Every edge has a capacity. If there is an edge from i to j, there is an edge from.
Advertisements

Maximum Flow and Minimum Cut Problems In this handout: Duality theory Upper bounds for maximum flow value Minimum Cut Problem Relationship between Maximum.
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Information Networks Graph Clustering Lecture 14.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
1 Chapter 7 Network Flow Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
1 Network Coding: Theory and Practice Apirath Limmanee Jacobs University.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
2010/5/171 Overview of graph cuts. 2010/5/172 Outline Introduction S-t Graph cuts Extension to multi-label problems Compare simulated annealing and alpha-
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Theoretical Results on Base Station Movement Problem for Sensor Network Yi Shi ( 石毅 ) and Y. Thomas Hou ( 侯一釗 ) Virginia Tech, Dept. of ECE IEEE Infocom.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Graph-Cut Algorithm with Application to Computer Vision Presented by Yongsub Lim Applied Algorithm Laboratory.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Face-centered cubic (FCC) lattice models for protein folding: energy function inference and biplane packing Allan Stewart.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Operations Research Assistant Professor Dr. Sana’a Wafa Al-Sayegh 2 nd Semester ITGD4207 University of Palestine.
Computational Complexity Polynomial time O(n k ) input size n, k constant Tractable problems solvable in polynomial time(Opposite Intractable) Ex: sorting,
Network Models (2) Tran Van Hoai Faculty of Computer Science & Engineering HCMC University of Technology Tran Van Hoai.
Using Motion Planning to Study Protein Folding Pathways Susan Lin, Guang Song and Nancy M. Amato Department of Computer Science Texas A&M University
CS774. Markov Random Field : Theory and Application Lecture 13 Kyomin Jung KAIST Oct
Topology aggregation and Multi-constraint QoS routing Presented by Almas Ansari.
Protein Folding in the 2D HP Model Alexandros Skaliotis – King’s College London Joint work with: Andreas Albrecht (University of Hertfordshire) Kathleen.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
Princeton University COS 423 Theory of Algorithms Spring 2001 Kevin Wayne Approximation Algorithms These lecture slides are adapted from CLRS.
Approximation Algorithms For Protein Folding Prediction Giancarlo MAURI,Antonio PICCOLBONI and Giulio PAVESI Symposium on Discrete Algorithms, pp ,
Data Structures & Algorithms Graphs
On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint.
1 Design and Analysis of Algorithms Yoram Moses Lecture 11 June 3, 2010
A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009.
Stabbing balls and simplifying proteins Ovidiu Daescu and Jun Luo Department of Computer Science University of Texas at Dallas Richardson, TX
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
A Membrane Algorithm for the Min Storage problem Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano – Bicocca WMC.
Approximation Algorithms Department of Mathematics and Computer Science Drexel University.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine.
Fixed parameter algorithms for protein similarity search under mRNA structure constrains A joint work by: G. Blin, G. Fertin, D. Hermelin, and S. Vialette.
Bijective tree encoding Saverio Caminiti. 2 Talk Outline Domains Prüfer-like codes Prüfer code (1918) Neville codes (1953) Deo and Micikevičius code (2002)
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
Tommy Messelis * Stefaan Haspeslagh Burak Bilgin Patrick De Causmaecker Greet Vanden Berghe *
Product A Product B Product C A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 B4B4 C1C1 C3C3 C4C4 Turret lathes Vertical mills Center lathes Drills From “Fundamentals of.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
Face-centered cubic (FCC) lattice models for protein folding: energy function inference and biplane packing Allan Stewart.
Computability and Complexity
Network Flow 2016/04/12.
Instructor: Shengyu Zhang
Chapter 11 Limitations of Algorithm Power
A Fundamental Bi-partition Algorithm of Kernighan-Lin
EE5900 Advanced Embedded System For Smart Infrastructure
謝孫源 (Sun-Yuan Hsieh) 成功大學 電機資訊學院 資訊工程系
Fast Min-Register Retiming Through Binary Max-Flow
Approximate Graph Mining with Label Costs
Complexity Theory: Foundations
Presentation transcript:

1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A

2 Acknowledgments This talk is based on joint work with colleagues & students at Yale University: Computer Science: Jim Aspnes Gauri Shah Biology: Julia Hartling Junhyong Kim

3 Dual Purposes of This Talk 1.Discuss protein folding problems. 2.Emphasize the point that as bioinformatics grows, advanced algorithmic techniques will become useful and crucial.

4 Importance of Protein Folding The 3D structure significantly determines the function.

5 Two Complementary Problems for Protein Folding 1.Protein Folding Prediction --- Given a protein sequence, determine the 3D folding of the sequence. 2.Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure.

6 Complexity for Protein Folding Problems Protein Folding Prediction --- Given a protein sequence, determine the 3D folding of the sequence. NP-hard under various models. Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure. Solvable in polynomial time under the Grand Canonical model.

7 History of Protein Sequence Design Protein Sequence Design --- Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure. 1.Sun et al, 1995: Heuristic search without optimality guarantee. 2.Hart, 1997: Open question on the computational tractability. 3.Kleinberg, 1999: Polynomial-time algorithms. 4.Aspnes, Hartling, Kao, Kim, Shah, 2001: Improved algorithms and generalized problems. this talk

8 Outline of Technical Discussions The Grand Canonical Model Two Basic Computational Problems Experimental Results Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others Further Algorithmic & Computational Hardness Results Conclusions

9 Outline of Technical Discussions (1) The Grand Canonical Model Two Basic Computational Problems Experimental Results Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others Further Algorithmic & Computational Hardness Results Conclusions

10 Grand Canonical Model (Sun et al, 1995) Each amino acid is classified as Hydrophobic (H) and Polar (P). Each amino acid sequence is then considered as a binary sequence of H and P. (For mathematical convenience, set H = 1 and P = 0). Hydrophobic (H): A, C, F, I, L, M, V, W, Y. Polar (P): the other amino acids. Sun, Brem, Chan, Dill. Designing amino acid sequences to fold with good hydrophobic cores. Protein Engineering, 1995.

11 Representation of a 3D structure: (Sun et al, 1995) A 3D folding structure S of n amino acid sequence: the coordinate of each atom in S. 1.the pairwise distances between the centers of amino acid residues in S. 2.the solvent-accessible areas of the amino acid residues in S.

12 Goal of Protein Sequence Design: (Sun et al, 1995) Input: A 3D structure S and a sequence length n. Output: a sequence X of n amino acids that, when folded into S, has the following properties: 1.The H-residues in X are as close to each other as possible. 2.The solvent-accessible areas of the H-residues of X are as small as possible.

13 Fitness of a Sequence (Sun et al, 1995)

14 Fitness of a Sequence (Sun et al, 1995) closeness among H-residues small surface area

15 Outline of Technical Discussions (2) The Grand Canonical Model Two Basic Computational Problems Experimental Results Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others Further Algorithmic & Computational Hardness Results Conclusions

16 Problem #1 Input: 1.the parameters alpha and beta, 2.a protein sequence Y, 3.Y’s 3D structure, 4.the sequence length n of Y. Output: a fittest sequence X for the 3D structure with respect to the given alpha and beta. Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

17 Problem #2 Input: 1.the parameters alpha and beta, 2.a protein sequence Y, 3.Y’s 3D structure, 4.the sequence length n of Y. Output: a fittest sequence X for the 3D structure that is the most similar to Y over all possible alpha and beta. Applications of this problem: tune the alpha and beta of the Grand Canonical model.

18 Basic Computational Scheme (1) 3D structure network a min cut HPPPHHPHP a fittest sequence

19 Problem #1 Input: 1.the parameters alpha and beta, 2.a protein sequence Y, 3.Y’s 3D structure, 4.the sequence length n of Y. Output: a fittest sequence X for the 3D structure with respect to the given alpha and beta. Applications of this problem: Design the best sequences for novel structures because we don’t really need Y. Computational Complexity: 1 network flow.

20 Problem #2 Input: 1.the parameters alpha and beta, 2.a protein sequence Y, 3.Y’s 3D structure, 4.the sequence length n of Y. Output: a fittest sequence X for the 3D structure that is the most similar to Y over all possible alpha and beta. Applications of this problem: tune the alpha and beta of the Grand Canonical model. Computational Complexity: O(n) network flows.

21 Outline of Technical Discussions (3) The Grand Canonical Model Two Basic Computational Problems Experimental Results Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others Further Algorithmic & Computational Hardness Results Conclusions

22 Empirical Study: Predictive Ability 1.Computed Fittest Sequence versus Native Sequences (% similarity) 2.Our % Similarity versus Kleinberg’s 3.% Similarity versus Protein Family Size.

23 % similarity --- computed versus native 1.% similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence. 2.The average percentage of the hydrophobic residues is 42% in the native sequences that were studied. 3.The best sequence picked without “domain knowledge” would have a 58% similarity on average.

24 % similarity --- computed versus native (1)

25 % similarity --- computed versus native (2) Our results versus Kleinberg’s

26 % similarity --- computed versus native (3)

27 % similarity versus PFAM family size (1) 1.% similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence. 2.PFAM family size of a protein = # of proteins in the PFAM database that are related to the given protein. The relatedness is computed via HMM models. pfam.wustl.edu measure of success of a protein in Nature.

28 % similarity versus PFAM family size (2) 1.% similarity = the percentage of the H/P’s in the computed fittest sequence that are identical to those in the native sequence. 2.PFAM family size of a protein = # of proteins in the PFAM database that are related to the given protein. Intuition/Conjecture: (3A) the more diverse a protein family is, (3B) the more its 3D structures vary, (3C) the smaller the % similarity will be.

29 % similarity versus PFAM family size (3)

30 % similarity versus PFAM family size (4)

31 Outline of Technical Discussions (4) The Grand Canonical Model Two Basic Computational Problems Experimental Results Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others Further Algorithmic & Computational Hardness Results Conclusions

32 Tool #1: Linear Programming Goal: find a fittest sequence X of n amino acids. find a binary sequence x that minimizes find x and y that 1.Linear 2.Totally unimodular 3.Integer solution 4.Useful for proving theorems 5.Still too inefficient clueless! quadratic

33 Tool #2: Network Flow (1) analogy: a network of oil pipes 1.source s (origin of oil) 2.sink t (destination of oil) 3.other nodes (midway stations) 4.arcs (pipes) 5.arc capacity (pipe capacity) 6.flow (amount of oil through a pipe) goal: deliver max amount of oil from source to sink computational goal: a max flow computational complexity: VE log (V 2 /E) s t

34 Tool #2: Network Flow (2) example of max flow 1.source (origin of oil) 2.sink (destination of oil) 3.other nodes (midway stations) 4.arcs (pipes) 5.arc capacity (pipe capacity) 6.flow (amount of oil through a pipe) goal: deliver max amount of oil from source to sink computational goal: a max flow computational complexity: VE log (V 2 /E) 14 (1) 14 (14)4 (4) 5 (5) 5 (4) 4 (4) 5 1 (1) 9 (9) 20 8 (5) 10 (5) s t

35 Tool #2: Network Flow (3) max flow versus min cut 1.min cut  bottleneck 2.a partition (S,T) of nodes with s in S and t in T. 3.total capacity of arcs from S to T = max flow. 14 (1) 14 (14)4 (4) 5 (5) 5 (4) 4 (4) 5 1 (1) 9 (9) 20 8 (5) 10 (5) s t

36 Tool #2: Network Flow (4) max flow versus min cut 1.min cut  bottleneck 2.a partition (S,T) of nodes with s in S and t in T. 3.total capacity of arcs from S to T = max flow. computational complexity: VE log (V 2 /E) 14 (1) 14 (14)4 (4) 5 (5) 5 (4) 4 (4) 5 1 (1) 9 (9) 20 8 (5) 10 (5) s t

37 Basic Computational Scheme (1) 3D structure network a min cut HPPPHHPHP a fittest sequence

38 Tool #2: 3D  Network (1) S 1 = 3 S 2 = 18 S 3 = 6 S 4 = 9 S 5 = 3 S 6 = 9 S 7 = 6 S 8 = 24 S 9 = 9 g(d 16 ) = 0.5 g(d 25 ) = 0.75 g(d 58 ) = 0.9 g(d 49 ) = 0.75 alpha = -8 beta = 1/3

39 Tool #2: 3D  Network (2) S 1 = 3 S 2 = 18 S 3 = 6 S 4 = 9 S 5 = 3 S 6 = 9 S 7 = 6 S 8 = 24 S 9 = 9 g(d 16 ) = 0.5 g(d 25 ) = 0.75 g(d 58 ) = 0.9 g(d 49 ) = 0.75 alpha = -8 beta = 1/ ,6 2,5 5,8 4, alpha*g(d ij ) beta*s i

40 Tool #2: 3D  Network (3) S 1 = 3 S 2 = 18 S 3 = 6 S 4 = 9 S 5 = 3 S 6 = 9 S 7 = 6 S 8 = 24 S 9 = 9 g(d 16 ) = 0.5 g(d 25 ) = 0.75 g(d 58 ) = 0.9 g(d 49 ) = 0.75 alpha = -8 beta = 1/ ,6 2,5 5,8 4, alpha*g(d ij ) beta*s i

41 Tool #2: 3D  Network (4) ,6 2,5 5,8 4, alpha*g(d ij ) beta*s i Theorem (Kleinberg, 1999) The amino acids that are with the source in a min cut are H’s.

42 Basic Computational Scheme (1) 3D structure network a min cut HPPPHHPHP a fittest sequence

43 Problem #1 Input: 1.the parameters alpha and beta, 2.a protein sequence Y, 3.Y’s 3D structure, 4.the sequence length n of Y. Output: a fittest sequence X for the 3D structure with respect to the given alpha and beta. Applications of this problem: Design the best sequences for novel structures because we don’t really need Y.

44 Tool #3: Linear Size Representation of All Min Cuts (1) 14 (1) 14 (14)4 (4) 5 (5) 5 (4) 4 (4) 5 1 (1) 9 (9) 20 8 (5) 10 (5) s t Step 1: Compute a max flow of G. Step 2: Compute the residual network G’. Step 3: Contract every strongly connected component into a super node. Call the new graph G”. Def: A node subset U of G” is a closed set if for every node x in U, every descendant of x is also in U. Theorem: (Picard and Queyranne, 1980) Every closed set not including the sink forms a min cut, and vice versa. v1 v2 v3 v4 v5 v7 v6

45 Tool #3: Linear Size Representation of All Min Cuts (2) s t v1 v2 v3 v4 v5 v7 v Residual Network

46 Tool #3: Linear Size Representation of All Min Cuts (3) 5 s t v1 v2 v3 v4 v5 v7 v6 Picard-Queyranne Representation

47 Tool #3: Linear Size Representation of All Min Cuts (4) 5 s t v1 v2 v3 v4 v5 v7 v6 Picard-Queyranne Representation Applications: 1.Obtain all fittest sequences. 2.Study the landscape of the fittest sequences. 3.Compute fittest sequences with additional optimization objectives.

48 Basic Computational Scheme (2) 3D structure network a max flow/min cut the space of all fittest sequences HPPPHHPHP Picard-Queyranne Representation

49 Outline of Technical Discussions (5) The Grand Canonical Model Two Basic Computational Problems Experimental Results Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others Further Algorithmic & Computational Hardness Results Conclusions

50 Problem #3 Input: a 3D structure. Output: all its fittest protein sequences. Computational Complexity: (A) A linear size representation can be computed with 1 network flow. (B) Each individual fittest protein sequences can be generated from this representation in O(n) time.

51 Problem #4 Input: f 3D structures. Output: the set of all protein sequences that are the fittest simultaneously for all these 3D structures. Computational Complexity: f network flows.

52 Problem #5 Input: a protein sequence Y and its native 3D structure. Output: the set of all fittest protein sequences that are also the most (or least) similar to Y in terms of unweighted (or weighted) Hamming distances. Computational Complexity: 1 network flow.

53 Problem #6 Input: a 3D structure. Output: Count the number of protein sequences in the solution to each of Problems #3, #4, and #5. Computational Complexity: #P-complete.

54 Problem #7 Input: a 3D structure and a bound e. Output: Enumerate the protein sequences whose fitness function values are within an additive factor e of that of the fittest protein sequences. Computational Complexity: polynomial time to generate each desired protein sequence.

55 Problem #8 Input: a 3D structure. Output: the largest possible unweighted (or weighted) Hamming distance between any two fittest protein sequences. Computational Complexity: 1 network flow.

56 Problem #9 Input: a protein sequence Y and its native 3D structure. Output: the average unweighted (or weighted) Hamming distance between Y and the fittest protein sequences for the 3D structure. Computational Complexity: #P-complete.

57 Problem #10 Input: a protein sequence Y, its native 3D structure, and two unweighted Hamming distances d 1 and d 2. Output: a fittest protein sequence whose distance from Y is also between d 1 and d 2. Computational Complexity: NP-hard.

58 Problem #11 Input: a protein sequence Y, its native 3D structure, and an unweighted Hamming distance d. Output: the fittest among the protein sequences which are at distance d from Y. Computational Complexity: NP-hard. We have a polynomial-time approximation algorithm.

59 Problem #12 Input: a protein sequence Y and its native 3D structure Output: all the ratios between the scaling factors alpha and beta in the GC model such that the smallest possible unweighted (or weighted) Hamming distance between Y and any fittest protein sequence is minimized over all possible alpha and beta. Computational Complexity: O(n) network flows.

60 Problem #13 Input: a 3D structure. Output: Determine whether the fittest protein sequences are connected, i.e., whether they can mutate into each other through allowable mutations, such as point mutations, while the intermediate protein sequences all remain the fittest. Computational Complexity: 1 network flow.

61 Problem #14 Input: a 3D structure and two fittest protein sequences. Output: Determine whether the two sequences are connected. Computational Complexity: 1 network flow.

62 Problem #15 Input: a 3D structure. Output: the smallest set of allowable mutations with respect to which the fittest protein sequences (or two given fittest protein sequences) for the structure are connected. Computational Complexity: 1 network flow.

63 Outline of Technical Discussions (6) The Grand Canonical Model Two Basic Computational Problems Experimental Results Combinatorial Tools (1a) Linear Programming (1b) Network Flow (1c) Compact Representation of All Min Cut (1d) others Further Algorithmic & Computational Hardness Results Conclusions

64 Further Research for Protein Sequence Design 1.More sophisticated models (biology). 2.Algorithms and complexity for such models (computer science). 3.Web lab validation (biology).

65 Further Algorithmic Research for Bioinformatics Current State of Bioinformatics: 1.Biology: mostly very simple heuristics 2.Algorithms: mostly very simple techniques Conjectures: 1.Biology: Nature is not so simple. Most of the biological information is very complicated. 2.Algorithms: Very sophisticated, novel, and fundamental techniques will be needed to unlock Nature’s secrets.