Generating Better Conformations for Roadmaps in Protein Folding PARASOL Lab, Department of Computer Science, Texas A&M University, Shawna Thomas Jeff May Lydia Tapia Nancy M. Amato Simulating Protein Folding Potential Landscape Conformation space Potential energy Target state The potential landscape can be very complicated. Different proteins have different landscapes which yield different folding behaviors. A Map to Approximate the Landscape A conformation A roadmap is a graph that approximates the protein’s potential landscape. Because it characterizes the main landscape features, it can be used to find folding pathways. Energy Function Our method is flexible and allows any potential function to be used. For the experiments shown, a coarse potential based on [Levitt ’83] was used. It includes van der Waals terms, hydrogen bonds, and hydrophobic effects. With this potential, only a few hours are needed to create a roadmap. Predict tertiary structure given the amino acid sequence. -- Protein structure determines function and is critical for drug design. Find folding pathways to the known tertiary structure (our work). -- Understand the folding process to design better structure prediction methods and to study diseases caused by misfolding. Secondary Structure Elements helix sheet + + variable loops = Tertiary Structure TTCCPSIVARSNFNVCRLPGTPEALCATYTGCIIIPGATCPGDYAN Primary Structure (amino acid sequence) Protein Folding Problems 1.Find k (a small constant) closest neighbors for each roadmap node C-space distance metric Euclidean, RMSD, Rigidity-based, … 2.Assign edge weight w(u,v) to reflect energetic feasibility: w(u,v) = f(E(c 1 ), E(c 2 ), E(c 3 ), … E(c n )) Lower weight implies more feasible c1c1 c2c2 c3c3 cncn … u v Improving Node Generation with MDP Policy Learning Summary Kinetics Analysis Methods for Approximate Folding Landscapes, L. Tapia, X. Tang, S. Thomas, and N. M. Amato, 15th Int. Conf. on Intelligent Systems for Molecular Biology (ISMB) & 6th European Conf. on Computational Biology (ECCB), to appear, July Simulating Protein Motions with Rigidity Analysis, S. Thomas, X. Tang, L. Tapia, and N. M. Amato, Journal of Computational Biology (JCB), to appear. Also, in Proc. of the 10th Int. Conf. on Computational Molecular Biology (RECOMB), pp , Tools for Simulating and Analyzing RNA Folding Kinetics, X. Tang, S. Thomas, L. Tapia, and N. M. Amato, Proc. of the 11th Int. Conf. on Computational Molecular Biology (RECOMB), to appear, April A Path Planning-Based Study of Protein Folding Pathways with a Case Study of Hairpin Formation in Protein G and L, G. Song, S. Thomas, K. A. Dill, J. M. Scholtz, and N. M. Amato, Proc. of the 7th Pacific Symp. on Biocomputing (PSB), pp , Using Motion Planning to Map Protein Folding Landscapes and Analyze Folding Kinetics of Known Native Structures, N. M. Amato, K. A. Dill, and G. Song, J. of Computational Biology (JCB), 10(2): , Also, in Proc. of the 6th Int. Conf. on Computational Molecular Biology (RECOMB), pp.2-11, Using Motion Planning to Study Protein Folding Pathways, G. Song and N. M. Amato, J. of Computational Biology (JCB), 9(2): , Also, in Proc. of the 5th Int. Conf. on Computational Molecular Biology (RECOMB), pp , References * This research supported in part by NSF Grants EIA , ACR , ACR , CCR , ACI and by the DOE, and by Hewlett-Packard. Tapia supported in part by a NIH Molecular Biophysics Training Grant (T32GM065088) and previously supported by a Department of Education GAANN Fellowship. Thomas supported in part by a Department of Education GAANN Fellowship and previously supported by a NSF Graduate Research Fellowship and a P.E.O. Scholarship. Roadmap Model for Protein Folding Protein Model An amino acid is modeled as a pair of phi/psi angles. A protein is a sequence of amino acids, and a conformation is: Protein folding becomes a problem with hundreds of degrees of freedom! N H O R CA C N N O O H H R R C C N Target state Denser distribution around native state Node generation can be biased to some known target conformation. We sample around it, gradually growing out. Node Generation Node Connection In the past, we have been able to extract low energy pathways, validated secondary structure formation order, and have seen general and consistent trends in reaction coordinates such as native contacts and RMSD. We have been able to extend this validation with folding rates, population kinetics, and reaction coordinates. Validation We provide a motion planning approach to study protein folding. We show how rigidity analysis can be used to improve the node generation process for building the roadmap. This method can be made more robust by using policy learning to reward or punish the use of parameter values. This can be used to help the researcher identify highly intricate parameter sets. For more information, please visit parasol.tamu.edu/foldingserver Parameter SelectionRigidity Analysis for Protein Folding Rigidity analysis determines a structure’s rigid and flexible components. It is fast and efficient; we can apply rigidity analysis to every conformation we sample. independent redundant Rigidity-Biased Sampling To perturb a conformation, we first determine the flexibility of the backbone bonds. Once the regions have been identified as rigid or flexible, we can use different probabilities of bending the angle at that residue and different values for how much we bend it. We accept/reject conformations based on their energy as before. The degrees of freedom (dof) can be grouped into into 3 categories: Independently flexible Rigid components Dependently flexible Parameters Nodes are generated by perturbing previously generated, valid conformations. We use four parameters in the process: P flex = The probability of perturbing the protein in flexible regions P rigid = The probability of perturbing the protein in rigid regions Angle flex = The angle of perturbation used in a flexible regions Angle rigid = The angle of perturbation used in rigid regions 1 DOF 2 DOF Rigid Flexible Policy Learning Algorithm 1)Choose an action. 2)Generate a reward from the outcome of the action 3)Add the reward to a score for that action. 4)Bias the selection of parameter sets towards actions with higher scores. A Markov Decision Process (MDP) is a learning process that involves choosing an action for a given state and receiving a reward for the outcome of that action. The goal of an MDP is to maximize the expected rewards. This is an example state diagram of an MDP: S : a finite set of states A : a finite set of actions R(s) : a reward function Each action has finite number of state transitions with an associate probability. The outcome is either rewarded or punished. Policy Learning is a simple algorithm of choosing an action using these rewards.` Choosing an Action An action is selected using the roulette wheel method. The proportion of the wheel that is allotted to a given action is: The user can also specify a learning rate, which sets the probability of choosing a random action. Rewarding Policy Training The research can specify one or more sets of values, which contain possible parameter values for each of the variables. From a value set we can pick a parameter set, containing: { P flex, P rigid, Angle flex, Angle rigid } There are many permutations of parameter sets that can be chosen from a value set. Each one is a possible action. The number of actions in the MDP is the number of permutations. Bonds in the folded protein constrain the movement of the protein in rigid regions. Flexible regions of the protein are not constrained in a bonded pair. Bending a residue in a rigid region may break a bonded pair and lead to a more unfolding conformation. To score a conformation we use the number of bonds that the conformation shares with the native state. These are called native contacts. Each permutation of the four parameters with the current value set has its own individual score. All the scores are initialized to be equal and are reset that way as well. The rewards are generated for each parameter set based on the resulting conformation: Large reward: it is in the current layer being filled Small reward: it is in another layer that is not full Penalty: it is in a layer that is full or it is not energy feasible. We slice the funnel into layers to gain an even spread of conformations. If multiple value sets are specified, then the value sets are assigned to different sections of the funnel. We fill the funnel, layer by layer, and reset the scores when entering a new section of the funnel. Rewards Where: Preliminary Results We first studied the effect that these parameter values had on the resulting conformations on various metrics and found the number of collisions and native contacts to be interesting. This 4D graph shows the average number of native contacts that the resulting conformations had for each parameter set. Motivation for Method The preliminary results showed that the values for each parameter are highly dependent on each other. This showed the difficulty of choosing the right parameter values from the start. There is a large set of possibilities, and the best set of parameters would change as the nodes are generated along different areas of the funnel. This motivated the idea that we should use an automated approach that automatically learned and chose what parameter sets worked well. Results After supplying the learning algorithm with a subset of the parameter values from before, we can see that the policy learning is significantly biasing the selection of parameter values towards higher scoring ones.. Learning Rate The learning rate determines the probability of choosing a random parameter set. A low learning rate encourages the selection of higher scoring parameter sets. Further work can be used to studying the effect that learning rates and other methods of selection may have on how quickly good parameter sets are isolated. ` A histogram of the number of times a parameter value is selected shows that the selection of parameters changes over time. This 4D graph shows the number of times the given parameter set was chosen. It is clear from the graph that the choice of parameter values is intricately connected based on the scores.