Structure and Motion Jean-Claude Latombe Computer Science Department Stanford University NSF-ITR Meeting on November 14, 2002
Stanford’s Participants PI’s: L. Guibas, J.C. Latombe, M. Levitt Research Associate: P. Koehl Postdocs: F. Schwarzer, A. Zomorodian Graduate students: S. Apaydin (EE), S. Ieong (CS), R. Kolodny (CS), I. Lotan (CS), A. Nguyen (Sc. Comp.), D. Russel (CS), R. Singh (CS), C. Varma (CS) Undergraduate students: J. Greenberg (CS), E. Berger (CS) Collaborating faculty: A. Brunger (Molecular & Cellular Physiology) D. Brutlag (Biochemistry) D. Donoho (Statistics) J. Milgram (Math) V. Pande (Chemistry)
Problem Domains Biological functions derive from the structures (shapes) achieved by molecules through motions Determination, classification, and prediction of 3D protein structures Modeling of molecular energy and simulation of folding and binding motion
What’s New/Interesting for Computer Science? Massive amount of experimental data Importance of similarities Multiple representations of structure Continuous energy functions Many objects forming deformable chains Many degrees of freedom Ensemble properties of pathways
Importance of similarities Importance of similarities Segmentation/matching/scoring techniques data set clustered data small library E.g.: Libraries of protein fragments [Kolodny, Koehl, Guibas, Levitt, JMB (2002)]
1tim Approximations Complexity 10 (100 fragments of length 5) A cRMS Complexity 2.26 (50 fragments of length 7) A cRMS real protein
Alignment of Structural Motifs [Singh and Saha; Kolodny and Linial] Problem: Determine if two structures share common motifs: 2 (labelled) structures in R 3 A={a 1,a 2,…,a n }, B={b 1,b 2,…,b m } Find subsequences s a and s b s.t the substructures {a s a (1),a s a (2),…, a s a (l) } {b s b (1),b s b (2),…, b s b (l) } are similar Twofold problem: alignment and correspondence Score Approximation Complexity
Iterative Closest Point (Besl-McKay) for alignment: [R. Singh and M. Saha. Identifying Structural Motifs in Proteins. Pacific Symp. on Biocomputing, Jan ] Score: RMSD distance
[R. Singh and M. Saha. Identifying Structural Motifs in Proteins. Pacific Symp. on Biocomputing, Jan ] Trypsin Trypsin active site
[R. Singh and M. Saha. Identifying Structural Motifs in Proteins. Pacific Symp. on Biocomputing, Jan ] Trypsin active site against 42Trypsin like proteins
Multiple representations of structure Multiple representations of structure ProShape software [Koehl, Levitt (Stanford), Edelsbrunner (Duke)]
Decoys generated using “physical” potentials Select best decoys using distance information Statistical potentials for proteins based on alpha complex [Guibas, Koehl, Zomorodian]
Many pairs of objects, but relatively few are close enough to interact Data structures that capture proximity, but undergo small or rare changes During motion simulation - detect steric clashes (self-collisions) - find pairs of atoms closer than cutoff - find which energy terms can be reused Continuous energy function Continuous energy function Many objects in deformable chains Many objects in deformable chains
Other application domains: Modular reconfigurable robots Reconstructive surgery
Fixed Bounding-Volume hierarchies don’t work Instead, exploit what doesn’t change: chain topology Adaptive BV hierarchies [Guibas, Nguyen, Russel, Zhang] [Lotan, Schwarzer, Halperin, Latombe] (SOCG’02) sec17
Wrapped bounding sphere hierarchies [Guibas, Nguyen, Russel, Zhang] (SoCG 2002) WBSH undergoes small number of changes Self-collision: O(n logn ) in R2 O(n2-2/d) in R d, d 3
ChainTrees [Lotan, Schwarzer, Halperin, Latombe] (SoCG’02)
Updating: Finding interacting pairs : (in practice, sublinear) Assumption: Few degrees of freedom change at each motion step (e.g., Monte Carlo simulation)
ChainTrees Application to MC simulation (comparison to grid method) (68)(144)(374) (755) (68)(144)(374) (755) m = 1m = 5
Many degrees of freedom Many degrees of freedom Tools to explore large dimensional conformational (structure) spaces: - Structure sampling [Kolodny, Levitt] - Finding nearest neighbors [Lotan, Schwarzer]
Sampling structures by combining fragments [Kolodny, Levitt] a b c d cabcab bbc Library of protein fragments Discrete set of candidate structures
Find k nearest neighbors of a given protein conformation in a set of n conformations (cRMS, dRMS) a0a0 a1a1 amam a6a6 a5a5 a4a4 a3a3 a2a2 Idea: Cut backbone into m equal subsequences Nearest neighbors in high-dimensional space [Lotan, Schwarzer]
Nearest neighbors in high-dimensional space [Lotan and Schwarzer] Full rep., dRMS (brute force)~84h Ave. rep., dRMS (brute force) :~4.8h SVD red. rep., dRMS (brute force)41min SVD red. rep., dRMS (kd-tree)19min 100,000 decoys of 1CTF (Park-Levitt set) Computation of 100 NN of each conformation ~80% of computed NNs are true NNs kd-tree software from ANN library (U. Maryland)
Ensemble properties of pathways Ensemble properties of pathways Stochastic nature of molecular motion requires characterizing average properties of many pathways Probabilistic conformational roadmaps Applications to protein folding and ligand-protein binding [Apaydin, Brutlag, Guestrin, Hsu, Latombe]
Example: Probability of Folding p fold Unfolded set Folded set p fold 1- p fold “We stress that we do not suggest using p fold as a transition coordinate for practical purposes as it is very computationally intensive.” Du, Pande, Grosberg, Tanaka, and Shakhnovich “On the Transition Coordinate for Protein Folding” Journal of Chemical Physics (1998). HIV integrase [Du et al. ‘98]
vivi vjvj P ij Probabilistic Roadmap [Apaydin, Brutlag, Hsu, Guestrin, Latombe] (RECOMB’02, ECCB’02) Idea: Capture the stochastic nature of molecular motion by a network of randomly selected conformations and by assigning probabilities to edges
P ii F: Folded setU: Unfolded set P ij i k j l m P ik P il P im Let f i = p fold (i) After one step: f i = P ii f i + P ij f j + P ik f k + P il f l + P im f m =1 One linear equation per node Solution gives p fold for all nodes No explicit simulation run All pathways are taken into account Sparse linear system Probabilistic Roadmap
Correlation with MC Approach 1ROP (repressor of primer) 2 helices 6 DOF
Monte Carlo: 49 conformations Over 11 days of computer time Over 10 6 energy computations Roadmap: 5000 conformations hours of computer time ~15,000 energy computations ~4 orders of magnitude speedup! Probabilistic Roadmap Computation Times (1ROP)
Summary Interpretation of electron density maps Statistical potential Library of protein fragments Self-collision and energy maintenance Structure alignment ProShape software Tools for high-dimensional spaces Probabilistic roadmaps Biology –Structure determination Modeling –Shape representation –Hierarchies Algorithms –Deformation –Motion planning –Shape organization Software –Alpha shapes
Future Work Perform more substantial experiments E.g., more realistic potentials in ChainTree and probabilistic roadmaps Extend tools to solve more relevant problems E.g., encode Molecular Dynamics into probabilistic roadmaps Combine results E.g., use library of fragments to sample probabilistic roadmaps Develop new algorithms/data structures E.g., sparse spanners to capture proximity information
Our Future: The BioX – Clark Center June 2003