Rapid Protein Side-Chain Packing via Tree Decomposition Jinbo Xu Toyota Technological Institute at Chicago.

Slides:

Advertisements

Similar presentations

Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.

Advertisements

Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.

Rooted Routing Using Structural Decompositions Jiao Tong University Shanghai, China June 17, 2013.

Rosetta Energy Function Glenn Butterfoss. Rosetta Energy Function Major Classes: 1. Low resolution: Reduced atom representation Simple energy function.

Crystallography -- lecture 21 Sidechain chi angles Rotamers Dead End Elimination Theorem Sidechain chi angles Rotamers Dead End Elimination Theorem.

1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 

© The McGraw-Hill Companies, Inc., Chapter 8 The Theory of NP-Completeness.

CS774. Markov Random Field : Theory and Application Lecture 17 Kyomin Jung KAIST Nov

Short fast history of protein design Site-directed mutagenesis -- protein engineering (J. Wells, 1980's) Coiled coils, helix bundles (W. DeGrado, 1980's-90's)

Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.

Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.

Rapid Protein Side-Chain Packing via Tree Decomposition Jinbo Xu Department of Mathematics Computer Science and AI Lab MIT.

With thanks to Zhijun Wu An introduction to the algorithmic problems of Distance Geometry.

Introduction to Approximation Algorithms Lecture 12: Mar 1.

Approximation Algorithms

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.

CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.

CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.

Vertex Cover, Dominating set, Clique, Independent set

Thomas Blicher Center for Biological Sequence Analysis

2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modelling Thomas Blicher Center for Biological Sequence Analysis.

1 Separator Theorems for Planar Graphs Presented by Shira Zucker.

Steiner trees Algorithms and Networks. Steiner Trees2 Today Steiner trees: what and why? NP-completeness Approximation algorithms Preprocessing.

1 Introduction to Approximation Algorithms Lecture 15: Mar 5.

(work appeared in SODA 10’) Yuk Hei Chan (Tom)

Protein Structure Prediction Samantha Chui Oct. 26, 2004.

Protein Side Chain Packing Problem: A Maximum Edge-Weight Clique Algorithmic Approach Dukka Bahadur K.C, Tatsuya Akutsu and Tomokazu Seki Proceedings of.

Approximation Algorithms

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Algorithms for Network Optimization Problems This handout: Minimum Spanning Tree Problem Approximation Algorithms Traveling Salesman Problem.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Graph Coalition Structure Generation Maria Polukarov University of Southampton Joint work with Tom Voice and Nick Jennings HUJI, 25 th September 2011.

Efficient Gathering of Correlated Data in Sensor Networks

PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.

What are proteins? Proteins are important; e.g. for catalyzing and regulating biochemical reactions, transporting molecules, … Linear polymer chain composed.

Planning Near-Optimal Corridors amidst Obstacles Ron Wein Jur P. van den Berg (U. Utrecht) Dan Halperin Athens May 2006.

Kernel Bounds for Structural Parameterizations of Pathwidth Bart M. P. Jansen Joint work with Hans L. Bodlaender & Stefan Kratsch July 6th 2012, SWAT 2012,

Design Techniques for Approximation Algorithms and Approximation Classes.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?

Rotamer Packing Problem: The algorithms Hugo Willy 26 May 2010.

Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.

Batch Scheduling of Conflicting Jobs Hadas Shachnai The Technion Based on joint papers with L. Epstein, M. M. Halldórsson and A. Levin.

Department of Mechanical Engineering

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

Register Placement for High- Performance Circuits M. Chiang, T. Okamoto and T. Yoshimura Waseda University, Japan DATE 2009.

Altman et al. JACS 2008, Presented By Swati Jain.

Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)

A Membrane Algorithm for the Min Storage problem Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano – Bicocca WMC.

Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.

Computing Branchwidth via Efficient Triangulations and Blocks Authors: F.V. Fomin, F. Mazoit, I. Todinca Presented by: Elif Kolotoglu, ISE, Texas A&M University.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Solving and Analyzing Side-Chain Positioning Problems Using Linear and Integer Programming Carleton L. Kingsford, Bernard Chazelle and Mona Singh Bioinformatics.

Algorithms for hard problems Parameterized complexity Bounded tree width approaches Juris Viksna, 2015.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Automated Refinement (distinct from manual building) Two TERMS: E total = E data ( w data ) + E stereochemistry E data describes the difference between.

The geometric GMST problem with grid clustering Presented by 楊劭文, 游岳齊, 吳郁君, 林信仲, 萬高維 Department of Computer Science and Information Engineering, National.

Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.

Introduction Wireless Ad-Hoc Network  Set of transceivers communicating by radio.

Algorithms and networks

Vertex Cover, Dominating set, Clique, Independent set

Exact Inference Continued

Bart M. P. Jansen June 3rd 2016, Algorithms for Optimization Problems

Protein structure prediction.

Exact Inference Continued

Presentation transcript:

Rapid Protein Side-Chain Packing via Tree Decomposition Jinbo Xu Toyota Technological Institute at Chicago

Background Method Results Outline

Biology in One Slide organism Protein

Proteins Proteins are the building blocks of life. In a cell, 70% is water and 15%-20% are proteins. Examples: hormones – regulate metabolism structures – hair, wool, muscle,… antibodies – immune response enzymes – chemical reactions

A protein is composed of a central backbone and a collection of (typically) amino acids (a.k.a. residues). There are 20 different kinds of amino acids each consisting of up to 18 atoms, e.g., Name3-letter code 1-letter code Leucine LeuL Alanine AlaA Serine SerS Glycine GlyG Valine ValV Glutamic acid GluE Threonine ThrT Amino Acids

O H O H O H O H O H O H O H H 3 N + CH C N CH C N CH C N CH C N CH C N CH C N CH C N CH COO - Protein Structure Asp Arg Val Tyr Ile His Pro Phe D R V Y I H P F Protein sequence: DRVYIHPF repeating backbone structure CH 2 CH 2 CH CH 2 H C CH 3 CH 2 CH 2 CH 2 CH 2 COO - CH 2 H 3 C CH 3 CH 2 HC CH CH 2 CH 2 CH 3 HN N OH NH CH C NH 2 N + H 2

Protein Structure Prediction Stage 1: Backbone Prediction –Ab initio folding –Homology modeling –Protein threading Stage 2: Loop Modeling Stage 3: Side- Chain Packing Stage 4: Structure Refinement The picture is adapted from

Protein Side-Chain Packing Problem: given the backbone coordinates of a protein, predict the coordinates of the side-chain atoms Insight: a protein structure is a geometric object with special features Method: decompose a protein structure into some very small blocks What are their positions?

Torsion Angles Each amino acid has 0 to 4 torsion angles. The positions of the side-chain atoms are determined if C-alpha, C-beta positions are known and torsion angles are fixed. Torsion angles of Lysine

Conformation Discretization clustering The probabilities can depend on local backbone structures.

Side-Chain Packing clash Each residue has many possible side-chain positions. Each possible position is called a rotamer. Need to avoid atomic clashes

Energy Function Minimize the energy function to obtain the best side-chain packing. Assume rotamer A(i) is assigned to residue i. The side-chain packing quality is measured by clash penalty occurring preference The higher the occurring probability, the smaller the value clash penalty : distance between two atoms :atom radii

Related Work NP-hard [Akutsu, 1997; Pierce et al., 2002] and NP- complete to achieve an approximation ratio O(N) [Chazelle et al, 2004] Dead-End Elimination: eliminate rotamers one-by-one Linear integer programming [Althaus et al, 2000; Eriksson et al, 2001; Kingsford et al, 2004] Semidefinite programming [Chazelle et al, 2004 ] SCWRL: biconnected decomposition of a protein structure [Dunbrack et al., 2003] –One of the most popular side-chain packing programs

Algorithm Overview Model the potential atomic clash relationship using a residue interaction graph Decompose a residue interaction graph into many small subgraphs (tree- decomposition) Do side-chain packing to each subgraph almost independently

Residue Interaction Graph Each residue as a vertex Two residues interact if there is a potential clash between their rotamer atoms Add one edge between two residues that interact. Residue Interaction Graph a b c d f e m l k j i h s

Key Observations 1.A residue interaction graph is a geometric neighborhood graph –Each rotamer is bounded to its backbone by a constant distance –There is no interaction edge between two residues if their distance is beyond D. D is a constant depending on rotamer diameter. 2.A residue interaction graph is sparse! –Any two residue centers cannot be too close. Their distance is at least a constant C. No previous algorithms exploit these features!

Tree Decomposition [Robertson & Seymour, 1986] h Greedy: minimum degree heuristic a b c d f e m l k j i g a c d f e m k j i h g abd l 1.Choose the vertex with minimal degree 2.The chosen vertex and its neighbors form a component 3.Add one edge to any two neighbors of the chosen vertex 4.Remove the chosen vertex 5.Repeat the above steps until the graph is empty

Tree Decomposition (Cont’d) Tree Decomposition Tree width is the maximal component size minus 1. a b c d f e m l k j i h g abd acd clk cdemdefm fg h eij abac clk c f fg h ij remove dem

Side-Chain Packing Algorithm 1. Bottom-to-Top: Calculate the minimal energy function 2. Top-to-Bottom: Extract the optimal assignment 3. Time complexity: exponential to tree width, linear to graph size The score of subtree rooted at X i The score of component X i The scores of subtree rooted at X j XrXr XpXp XiXi XjXj XlXl XqXq Xir X ji X li A tree decomposition rooted at X r The scores of subtree rooted at X l

For a general graph, it is NP-hard to determine its optimal treewidth. Has a treewidth –Can be found within a low-degree polynomial-time algorithm, based on Sphere Separator Theorem [G.L. Miller et al., 1997], a generalization of the Planar Separator Theorem Has a treewidth lower bound –The residue interaction graph is a cube –Each residue is a grid point Theoretical Treewidth Bounds

K-ply neighborhood system –A set of balls in three dimensional space –No point is within more than k balls Sphere separator theorem –If N balls form a k-ply system, then there is a sphere separator S such that –At most 4N/5 balls are totally inside S –At most 4N/5 balls are totally outside S –At most balls intersect S –S can be calculated in random linear time Sphere Separator Theorem [G.L. Miller & S.H. Teng et al, 1997]

Residue Interaction Graph Separator D Construct a ball with radius D/2 centered at each residue All the balls form a k-ply neighborhood system. k is a constant depending on D and C. All the residues in the blue cycles form a balanced separator with size.

Each S i is a separator with size Each S i corresponds to a component –All the separators on a path from S i to S 1 form a tree decomposition component. Separator-Based Decomposition S1S1 S2S2 S3S3 S6S6 S7S7 S4S4 S5S5 Height= S 10 S 11 S8S8 S9S9 S 12

Empirical Component Size Distribution Tested on the 180 proteins used by SCWRL 3.0. Components with size ≤ 2 ignored. DEE is conducted before tree decomposition. Otherwise, component size will be bigger.

Result (1) proteinsizeSCWRLTreePackspeedup 1gai a8i b0p bu xwl Five times faster on average, tested on 180 proteins used by SCWRL 3.0 Same prediction accuracy as SCWRL CPU time (seconds) Theoretical time complexity: << is the average number rotamers for each residue. TreePack can solve some instances that SCWRL cannot!!!

Result (2): Chi1 Accuracy A prediction is judged correct if its deviation from the experimental value is within 40 degree.

Result (3): Non-native Backbones Chi1Chi1+2 TreePack SCWRL SCAP MODELLER Tested on 24 CASP6 targets, backbone structures are generated by RAPTOR+MODLLER.

Has a PTAS if one of the following conditions is satisfied: –All the energy items are non-positive –All the pairwise energy items have the same sign, and the lowest system energy is away from 0 by a certain amount Result (4) An optimization problem admits a PTAS if given an error ε (0<ε<1), there is a polynomial-time algorithm to obtain a solution close to the optimal within a factor of (1±ε). Chazelle et al. have proved that it is NP-complete to approximate this problem within a factor of O(N), without considering the geometric characteristics of a protein structure.

A PTAS for Side-Chain Packing Partition the residue interaction graph to two parts and do side-chain assignment separately.

A PTAS (Cont’d) To obtain a good solution –Cycle-shift the shadowed area by iD (i=1, 2, …, k-1) units to obtain k different partition schemes –At least one partition scheme can generate a good side-chain assignment

Application to Membrane Proteins Cryo-EM density map of the gap junction channel, at 5.7 Å resolution in the membrane plane and 19.8 Å resolution in the vertical direction. The alpha- carbon model presented in Fleishman et. al., Molecular Cell 15, 879–888 (2004) is superimposed. Red spheres, corresponding to disease-causing mutations, are located at helix-helix interfaces. Half of the connexon model has been cropped to view the side chain packing between the various helices. The coloring is by polarity, as in the CPK figure. Most aromatic side chains are packed between the putative 4-helix bundles, as well as on the perimeter facing the lipid ’ 2’ 3’ 4’ 1” 2” 3” 4” ’ 2’ 3’ 4’ 1” 2” 3” 4” ’ 2’ 3’ 4’ 1” 2” 3” 4” ’ 2’ 3’ 4’ 1” 2” 3” 4” Pictures are taken from Julio Kovacs. RMSD=5.7Å RMSD=19.8Å RMSD=0.6Å

Summary Give a novel tree-decomposition-based algorithm for protein side-chain prediction –Exploit the geometric features of a protein structure –Theoretical bound of time complexity –Polynomial-time approximation scheme –Efficient in practice, good accuracy –Can be used for sampling-based ab intio protein folding Work To Do –Add more energy items to the energy function –Apply the algorithm to protein docking and protein interaction prediction TreePack at

Acknowledgements Ming Li (Waterloo)Bonnie Berger (MIT)

Thank You

Tree Decomposition [Robertson & Seymour, 1986] Original Graph a b c d f e m l k j i h g c d f e m k j i h g abd ac d l Greedy: minimum degree heuristic a c d f e m k j i h g abd l

Tree Decomposition [Robertson & Seymour, 1986] Let G=(V,E) be a graph. A tree decomposition (T, X) satisfies the following conditions. –T=(I, F) is a tree with node set I and edge set F –Each element in X is a subset of V and is also a component in the tree decomposition. Union of all elements is equal to V. –There is an one-to-one mapping between I and X –For any edge (v,w) in E, there is at least one X(i) in X such that v and w are in X(i) –In tree T, if node j is a node on the path from i to k, then the intersection between X(i) and X(k) is a subset of X(j) Tree width is defined to be the maximal component size minus 1