2. Introduction to Rosetta and structural modeling

2. Introduction to Rosetta and structural modeling

Short history – in the beginning…
Initial goal – find structure given sequence. Motivation – structure defines function Challenge – many degrees of freedom. Rosetta started as a program aimed at finding the structure of a protein given a sequence alone, ab-initio. Achieved for relatively small proteins. Reliable fold identification for short proteins. Recently improved to yield high-resolution models (within 2A RMSD)

A short history of Rosetta
Success of ab initio protocol led to extension to Protein design Design of new fold: TOP7 Protein loop modeling; homology modeling Protein-protein docking; protein interface design Protein-ligand docking Protein-DNA interactions; RNA modeling Many more, e.g. solving the phase problem in Xray crystallography ATCSFFGRKLL….. ATCSFFGRKLL…..

The basic assumption in structure prediction
Native structure located in global minimum (free) energy conformation (GMEC) A good Energy function can select the correct model among decoys A good sampling technique can find the GMEC in the rugged landscape Assumption is that there’s a very deep energy funnel and at its bottom the native conformation is found, so that the large energy gap will prevent protein escape to alternative conformations most of the time. Advantage is that even with errors in the energy function, the signal can still be identified. GMEC E Conformation space

How to approch such a problem
Sampling Reduce system Degrees Of Freedom (DOF) Assume ideal bond lengths and angles Rough sampling until reaching near global minimum Scoring function based in part on the Boltzman principle, correlating frequency to energy Scoring In these simulations we assume ideal bond lengths and bond angles, as seen in most crystal structures. This reduces much of the system DOFs, but still a lot remains, so Rosetta has additional “tricks”, e.g. more accurate, computationally expensive sampling only in low energy regions.

Two-Step Procedure Low-resolution step locates potential minima (fast)
Cluster analysis identifies broadest basins in landscape High-resolution step can identify lowest energy minimum in the basins (slow) E First step is performed in a smooth energy landscape, where a rough representation of the protein reduces dramatically the DOFs - allows identification of low energy basins. Next, a rugged space is used, where small changes lead to drastic changes in energy. In such a space there’s need for more gentle sampling. Conformation space GMEC

Low-Resolution Step Structure Representation:
Equilibrium bonds and angles (Engh & Huber 1991) Centroid: average location of center of mass of side-chain (Centroid | aa, f,) No modeling of side chains Fast This step is aimed at identifying the approximate fold of the protein, without interaction details. All backbone atoms are represented, including their hydrogens, while side chains are replaced by spheres with the approximate size and properties of the side chain. Using statistical analysis, we can derive properties of the centroids (e.g. phe in beta strands – calculate location of side chain center of mass). Picture shows a structure at the end of this step.

Low-Resolution Scoring Function (e.g. score4)
Bayes Theorem: P(str | seq) = P(str)*P(seq|str) / P(seq) Independent components prevent over-counting structure dependent features sequence- dependent features Constant (for structure prediction) Scoring function is decomposable, with its terms independent of each other. In bayes theorem, if we want to identify the probability of a structure given a certain sequence, we can decompose it to: [probability to get such a structure (comprised of terms such as the probability to get such compactness) * probability to get the sequence given a certain structure]/the probability to get a certain sequence (which in the case of structure prediction is 1).

Sequence-Dependent Components
Bayes Theorem: P(str | seq) = P(str) * P(seq | str) / P(seq) Score = Senv+ Spair + … Terms where a probability is calculated are based on statistics of known, accurate structures Environment term - Burial of a residue: count how many Cbeta atoms are found within 10A of its Cb. The term calculates the probability to find this amino acid given the degree of burial. Implicitly accounts for solvation - the hydrophobic residues have more neighbors in a well solvated structure. Pair – the probability of finding two residues at some sequence separation at a given distance. This term is normalized by the probability of each of them being found in such a distance from any aa, because this information is already accounted for in the env term – this allows the terms to be independent. Implicitly accounts for electrostatics and disulfide bond formation – cysteines and opposite charges will be found together more often. neighbors: Cb-Cb <10Ǻ Rohl et al. (2004) Methods in Enzymology 383:66 Origin: Simons et al., JMB 1997; Simons et al., Proteins 1999

Structure-Dependent Components
P(str | seq) = P(str) * P(seq | str) / P(seq) Score = … + Srg + Scb + Svdw + … Rg – radius of gyration (root of summed squared distances of all Calpha atom pairs). Will be relatively small for a globular, well packed structure. Implicitly accounts for attraction – residues near in space, and solvation, which reduces value compared to a non solvated protein. Cbeta – how compact the structure is compared to random burial of amino acids. Corrects for the lack of interactions with the surrounding waters that were excluded. Repulsive component of VdW – Penalty for each pair of residues that are found in a distance that is smaller than their sum of VdW radii (representing for each atom the region not to be penetrated).

Structure-Dependent Components
P(str | seq) = P(str) * P(seq | str) / P(seq) Score = … + Srama ….+…..+ 10 Rama - given an amino acid with some secondary structure, what are the odds of finding it in the this phi, psi bin (ranges of 10 degree intervals, because odds of having exact phi,psi value pair is ~0).

Low resolution step over!
The resulting kind of structure from the low resolution step… the backbone will still be free to move in the following step, in the vicinity of the achieved fold, since side chain modeling might influence it, but in a finer sampling method so we don’t stray too much from this achieved model.

High-Resolution Step Structure Representation: Slow, exact step
All-atom (including polar and non-polar hydrogens, but no water) Side chains as rotamers from backbone-dependent library Side chain conformation adjusted frequently High resolution scoring function e.g. score12; Talaris; … Slow, exact step Locates global energy minimum Backbone changes in this step are moderate For each basin of low energy in the low resolution step a more fine tuned, high resolution search is performed. A rotamer would be a combination of side chain dihedral angles (denoted chi1, chi2, etc.), which is commonly observed in proteins for a given amino acid. Dunbrack 1997

High-Resolution Step: Rotamer Libraries
Side chains have preferred conformations They are summarized in rotamer libraries Select one rotamer for each position Best conformation: lowest-energy combination of rotamers Rotamer libraries are based on statistics of known structures. Side chain preferences stem from an attempt to minimize side chain-backbone steric hindrance. Each peak in the distribution will be represented by a rotamer. Serine c1 preferences t=180o g+=+60o g-=-60o

High-Resolution Step – Major Contributions
Burial of hydrophobic groups away from water Void-free packing of buried groups and atoms Buried polar atoms form intra-molecular hydrogen bonds At the end of this step, structure should behave like a native one, e.g. no buried unsatisfied polar atoms (not making any polar interaction), have pockets where water can fit in but not voids smaller than a water molecule.

High-Resolution Scoring Function
(score 12) Packing interactions Score = SLJ(atr + rep) + …. Lennard Jones equation describes VdW interactions between two atoms. There’s linearization of the equation in the region where repulsion is dominant, since models can only jump between discrete side chain rotamers – we can’t make the very small changes that will reduce the energy in case there’s a clash, so instead we just don’t penalize that clash dramatically. Between 5-5.5A there’s another linearization up to 0, so we don’t calculate all the interactions with faraway atoms that barely influence each other. Linearized repulsive part Linearized attractive part e: well depth from CHARMm19 rij d=5.5

Implicit solvation Score = … + Ssolvation + …. xij2 xji2 xij=(rij - Ri)/li Excluded volume implicit solvation model: Penalizes buried polars The idea is that for each atom there is a reference energy – its energy in case it was surrounded by water only, and from that we reduce the energy of each surrounding atom that replaces the water. The VdW interactions calculated before come at the price of losing the surrounding waters, so this term introduces this loss too, without actually modeling water molecules (which is computationally expensive) solvation free energy density of i polar polar Lazaridis & Karplus, Proteins 1999

Solvation energy polar polar
xij2 xji2 solvation free energy density of i Excluded volume implicit solvation model: Penalizes buried polars Solvation free energy density is assumed to be approximated by a Gaussian distribution fi(r)4 p r2 = ai exp (-xi2) xi= (r – Ri)/li li= 3.5A (6.0A for de-ionized groups) correlation length (width of first, or 2 first solvation shells) ai = 2 * D Gifree/(sqrt p * li) proportionality coefficient

Hydrogen Bonding Energy
(Kortemme, Morozov & Baker 2003 JMB) Score = …. + Shb(srbb+lrbb+sc) + …. ] srbb: short range, backbone HB lrbb: long range, backbone HB sc: HB with side chain atom Hbonds are scored separately for side chains and for the backbone with itself, between residues near or far in sequence. Bonds involving protein backbones indicate secondary structure. Backbone-sidechain hydrogen bonds are also split into short-range and long-range, but this time according to the physical distance between the hydrogen and the acceptor. This is because the angular distributions of side-chain–side-chain h-bonds differs for the two different distance ranges. Residues participating in backbone-backbone H-bonds are not allowed to participate in a backbone-sidechain hydrogen bond (there was a bias towards residues that can form both types). White bars- standard potential. Based on statistics from high-resolution structures in the PDB Slide from Jeff Gray

Rotamer preference Score = … + Sdunbrack + …. The probability of the side chain is calculated as [probability of seeing that conformation given a phi,psi value * probability of seeing an amino acid given that phi,psi]/probability of that aa. Dunbrack, 1997

Scoring Function: Summary
One long, generic function …. Score = Senv+ Spair + Srg + Scb + Svdw + Sss+ Ssheet+ Shs + Srama + Shb (srbb + lrbb) + docking_score + Sdisulf_cent+ Srs+ Sco + Scontact_prediction + Sdipolar+ Sprojection + Spc+ Stether+ Sfy+ Sw+ Ssymmetry + Ssplicemsd + ….. docking_score = Sd env+ Sd pair + Sd contact+ Sd vdw+ Sd site constr + Sd + Sfab score Score = SLJ(atr + rep) + Ssolvation + Shb(srbb+lrbb+sc) + Sdunbrack + Spair – Sref + Sprob1b + Sintrares + Sgb_elec + Sgsolt + Sh2o(solv + hb) + S_plane

Scoring Function: Summary
One long, generic function …. A weighted sum of different terms: knowledge based and physics based Score12 = w1*SLJatr + w2*SLJrep + w3*Ssolvation + w4*Shb(srbb+lrbb+sc) + w5*Sdunbrack + w6*Spair – Sref Eventually the overall score of the protein is the sum of these different considerations, weighted according to the desired impact of each term in our protocol. When running different protocols, sometimes a different weight set is chosen, for instance in docking the weight of terms related to the interaction between proteins is increased, while the terms affecting mainly the structure of each monomer are decreased. How can it be improved ? Feature Analysis Tool : improve parameters OptE : optimize weights Leaver-Fay, …, & Baker (2013) Methods in Enzymology 523:109

How are scoring terms optimized?
Nature uses one scoring function… Aim: one generic function for different applications Optimization of parameters: Originally from small molecules (experiments & quantum mechanical calculations) Today: use of protein structures solved at high-accuracy Benchmarks: Discriminate ground state from alternative conformations Identify correct side chain conformation Predict effect of stability of point mutations (DDG) We would want to have one energy function for the different desired tasks. From each task, new global truths can be discovered to improve the generic score function. Some parameters are optimized using potentials from quantum mechanical calculations. We complement these types of terms with observations from protein structures. To optimize a proposed function there are benchmarks evaluating the success of the function: take different structures solved by x-ray, perturb them to different non native conformations and optimize a scoring function until near-native conformations are preferred over the alternative conformations. side chains are stripped from a correct backbone conformation and need to be remodeled. This is a subtask of the aforementioned task, required for its success. dG is the change in free energy when a protein folds (compared to unfolded state). ddG is the energy difference between the dG of a WT protein to that of a mutant protein and it represents the energetic effect of the mutation. Predicting this requires small changes to the protein, so if this fails, larger structural perturbation will also not be successful. Leaver-Fay, …, & Baker (2013) Methods in Enzymology 523:109

Feature Analysis : improve scoring term
e.g. HB distance H- Og in Ser & Thr Aim: similar distributions in crystal structures and models Leaver-Fay, …, & Baker (2013) Methods in Enzymology 523:109

Feature Analysis : improve scoring term
e.g. HB distance H- Og in Ser & Thr After correction: distribution in native & model structures overlap Figure 6.3 H-bond length distributions for hydroxyl donors (SER/THR) to backbone oxygens. The thick curves are kernel density estimations from observed data normalized for equal volume per unit distance. The black curve in the background of each panel represents the Native sample source. (A) Boltzmann distribution for the length term in the Rosetta H-bond model with the Score12 and NewHB parameterizations. (B) Relaxed Natives with the Score12 energy function. The excessive peakiness is due to a discontinuity in the Score12 parametrization of the H-bond model. (C) Relaxed Natives with the NewHB energy function. (D) Relaxed Natives with the NewHB energy function and the Lennard–Jones minima between the acceptor and hydroxyl heavy atoms adjusted from 3.0 to 2.6 Å, and between the acceptor and the hydrogen atoms adjusted from 1.95 to 1.75 Å. Aim: similar distributions in crystal structures and models Leaver-Fay, …, & Baker (2013) Methods in Enzymology 523:109

OptE : optimize weights
Score12 = w1*SLJatr + w2*SLJrep + w3*Ssolvation + w4*Shb(srbb+lrbb+sc) + w5*Sdunbrack + w6*Spair – Sref Maximum Likelihood Parameter Estimation Benchmarks: Discriminate ground state from alternative conformations Identify correct side chain conformation Sequence recovery in design: choose correct amino acid residue Predict effect of stability of point mutations (DDG) & more … Aim: Best score for correct prediction Leaver-Fay, …, & Baker (2013) Methods in Enzymology 523:109

Talaris-2013 scoring function
Parameter changes in some terms (e.g. improved geometries for h-bonds, disulfides; new rotamer libraries). better interpolation of knowledge-based terms Explicit Coulombic electrostatic term with a distance-dependent dielectric; removal of the fa_pair term rij d=6.0 We use in the course a relatively old version of Rosetta, since it is considered stable. In the newer, recommended versions, there have been significant improvements to the scoring function. The most prominent change is the addition of an explicit electrostatic term (and the removal of the fa_pair term, which was supposed to implicitly consider electrostatics by the preference of charged residues to be near each other). Leaver-Fay et al., Methods in Enzymology 2013 O’Meara et al., J. Chem. Theory Comput. 2015

Representations of protein structure: Cartesian and polar coordinates
PDB x y z ATOM N GLN A N ATOM CA GLN A C ATOM C GLN A C ATOM O GLN A O ….. …. Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 2 3 …. …

2 ways to represent the protein structure
Cartesian coordinates (x,y,z; pdb format) Polar coordinates (F-Y-W; equilibrium angles and bond lengths) A protein is represented in the pdb as a set of atoms, each having their own x,y,z coordinates. Once we assume fixed bond lengths and bond angles, an alternative representation can store the first 3 atoms of the first residue as coordinates and then all other atoms will be represented only by their phi,psi angles relative to previous atoms. ATOM N GLN A N ATOM CA GLN A C ATOM C GLN A C ATOM O GLN A O …..

The two representations in 2D
Cartesian representation: points: (0,0),(1,1),(1,2),(2,2),(3,3) connections (of predefined length): Internal coordinates: bond lengths (predefined): R=√2,1,1,√2 angles: =450,90o,0o,45o x (3,3) x √2 4-5 1 Same idea in 2D, for simplification (1,2) 45o (2,2) 3-4 2-3 1 90o 1-2 (1,1) √2 45o y y (0,0)

The two representations in 2D
Constraint: keep bond length fixed upon movement Cartesian representation (0,0),(1,1),(1,2),(2,2),(3,3)  (0,0),(1,1),(1,2),(2,2),(3,0) Bond length changed! Polar coordinates 450,90o,0o,45o  450,90o,45o,45o Bond length unchanged! x x The advantage of polar coordinates representation over Cartesian coordinates is that any change in an angle does not change bond lengths and the structure is not distorted. √2 √3 y y

Polar Cartesian coordinates
Convert r and q to x and y x y √2,1,1,√2 450,90o,0o,45o (0,0),(1,1),(1,2),(2,2),(3,3) From wikipedia

Cartesianpolar coordinates
Convert x and y to r and q x y (0,0),(1,1),(1,2),(2,2),(3,3) √2,1,1,√2 450,90o,0o,45o

Moving to a 3D world Cartesian representation: Internal coordinates:
points: additional z-axis (0,0,0),(1,1,0),(1,2,0), (2,2,0),(3,3,0) connections (predefined): 1-2,2-3,3-4,4-5 Internal coordinates: bond lengths (predefined): √2,1,1,√2 angles: 450,90o,0o,45o dihedral angles: 1800,180o z y x Proteins: bond lengths and angles fixed. Only dihedral angles are varied

2 ways to represent the protein structure
Cartesian coordinates (x,y,z; pdb format) Intuitive – look at molecules in space Easy calculation of energy score (based on atom-atom distances) Difficult to change conformation of structure (while keeping bond length and bond angle unchanged) Polar coordinates (F-Y-W; equilibrium angles and bond lengths) Compact (3 values/residue) Easy changes of protein structure (turn around one or more dihedral angles) Non-intuitive Difficult to evaluate energy score (calculation of neighboring matrix complicated) Another advantage for polar coordinates is that the representation is compact – need less information stored to know where each atom lies. The disadvantages of this representation are exactly the advantages of Cartesian representation.

Solution: toggle MOVE STRUCTURE - Polar coordinates:
Transform: calculate dihedral angles from coordinates MOVE STRUCTURE - Polar coordinates: introduce changes in structure by rotating around dihedral angle(s) (change F-Y values) CALCULATE ENERGY - Cartesian coordinates: Derive distance matrix (neighbor list) for energy score calculation It is possible to enjoy the advantages of both representations. It is rather easy to convert one representation to the other. The workflow is as following: Move structure in polar space (without distorting it) Build Cartesian representation from the polar coordinates and calculate the energy (which heavily relies on atom distances - very simple in Cartesian space). Build Polar coordinates representation and move the structure again, and so on. Transform: build positions in space according to dihedral angles PDB x y z ATOM N GLN A N ATOM CA GLN A C ATOM C GLN A C ATOM O GLN A O ….. …. Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 2 3 …. … (0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

Cartesian polar coordinates
How to calculate polar from Cartesian coordinates: example F: C’-N-Ca-C define plane perpendicular to N-Ca (b2) vector calculate projection of Ca-C (b3) and C’-N (b1) onto plane calculate angle between projections PDB x y z … ATOM C GLN A N ATOM N GLY A C ATOM CA GLY A C ATOM O GLY A O ….. …. Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 ….. 33 34 …. … (0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

Polar Cartesian coordinates
Find x,y,z coordinates of C, based on atom positions of C’, N and Ca, and a given F value (F: C’-N-Ca-C) create Ca-C vector: size Ca-C=1.51A (equilibrium bond length) angle N-Ca-C= 111o (equilibrium value for N-Ca-C angle) rotate vector around N-Ca axis to obtain projections of Ca-C and N-C’ with desired F PDB x y z … ATOM C GLN A N ATOM N GLY A C ATOM CA GLY A C ATOM O GLY A O ….. …. Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 ….. 33 34 …. … (0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

Representation of protein structure
1 2 3 4 5 6 7 8 3 backbone dihedral angles per residue Build coordinates of structure starting from first atom, according to dihedral angles (and equilibrium bond length and angle) – Atom tree 1 2 3 4 5 6 7 8 7 8 In Rosetta there’s a dedicated object that stores the information to convert representations – the atom tree. When changing a dihedral angle between residue 6 and residue 7, there is no need to recalculate the Cartesian coordinates of all the previous residues, just the ones following the change (even if only one dihedral changed, all the following segment changes along with that rotation). Sampling and minimization in TORSIONAL space: change angle and rebuild, starting from changed angle Based on slides by Chu Wang

Representation of protein structure
Rosetta folding 1 2 3 4 5 6 7 8 3 backbone dihedral angles per residue Sampling and minimization in TORSIONAL space Sampling and minimization in RIGID-BODY space The atom tree allows to change the representation for all the covalently connected residues, but when there are two bodies interacting (e.g. two different proteins or a protein and a ligand) there’s need to sample other degrees of freedom (movement along the x,y,z axes and 3 possible rotations). Backbone dihedral angles fixed (rigid-body) 4 3 1 2 8 7 5 6 Rosetta docking 4’ 3’ 1’ 2’ 8’ 7’ 5’ 6’ 6 rigid-body DOFs -- 3 translational vectors 3 rotational angles How can those two types of degrees of freedom be combined?

Fold tree representation
Construct fold-trees to treat a variety of protein folding and docking problems. Allows simultaneous optimization of rigid-body and backbone/sidechain torsional degrees of freedom. Example: fold-tree based docking “peptide” edge – 3 backbone dihedral angles 4 3 1 2 8 7 5 6 “long-range” edge (Jump) – 6 rigid-body DOFs The fold tree object is responsible for defining connectivity between residues in different entities. In the case of two interacting proteins, each polymeric connection within the monomers is represented by a peptide edge and there is one non polymeric connection, a jump edge, defining the orientation of one protein relative to the other. 4’ 3’ 1’ 2’ 8’ 7’ 5’ 6’ 4’ 3’ 1’ 2’ 8’ 7’ 5’ 6’ “peptide” edge – 3 backbone dihedral angles Fold tree: Bradley and Baker, Proteins (2006)

Fold-trees for different modeling tasks
protein folding N C Color – flexible bb Gray – fixed bb Flexible “peptide” edge rigid “peptide” edge 1 1’ rigid “jump” 1 1’ flexible “jump” N: N-terminal; C: C-terminal; X: chain break; O: root of the tree;

Fold-tree for loop modeling
1 x 1’ 2 x 2’ C If we have a protein fold we want to preserve but want to model two loops in it, we can define which residues will remain fixed – a rigid jump between the ends of the loop will define the continuation of the structure relative to the segment before the loop (so one dihedral change will not propagate along the structure), and the loop itself will be flexible. The problem in this situation is that for the residues x in the loop, there are instructions on how to be built from two routes in the tree and these instructions will conflict once the loop is moved. In other words, this representation does not allow any cycles to exist. Therefore, the solution is to add a cut point wherever there’s a cycle in the tree. Color – flexible bb Gray – fixed bb Flexible “peptide” edge rigid “peptide” edge 1 1’ rigid “jump” 1 1’ flexible “jump” N: N-terminal; C: C-terminal; X: chain break; O: root of the tree;

1 C 1’ fully flexible docking N 1 1’ C 2 2’ x 3’ 3 docking w/ loop modeling What complicates things is that this representation does not allow any cycle to avoid situations where a residue can be reached by two different routes in the tree and in fact has conflicting instructions on how to be buil N 1 C 1’ docking w/ hinge motion Color – flexible bb Gray – fixed bb Flexible “peptide” edge rigid “peptide” edge 1 1’ rigid “jump” 1 1’ flexible “jump” N: N-terminal; C: C-terminal; X: chain break; O: root of the tree;

Color – flexible bb Gray – fixed bb Pale – symmetry operation

Color – flexible bb Gray – fixed bb Filled colored circles - flexible sc

Color – flexible bb Gray – fixed bb Filled colored circles - flexible sc empty colored circles – flexible amino acid: design

2. Introduction to Rosetta and structural modeling

Similar presentations

Presentation on theme: "2. Introduction to Rosetta and structural modeling"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2. Introduction to Rosetta and structural modeling

Similar presentations

Presentation on theme: "2. Introduction to Rosetta and structural modeling"— Presentation transcript:

Similar presentations

About project

Feedback