2. Introduction to Rosetta and structural modeling (From Ora Schueler-Furman) Approaches for structural modeling of proteins The Rosetta framework and its prediction modes Cartesian and polar coordinates Sampling (finding the structure) and scoring (selecting the structure)
Structural Modeling of Proteins - Approaches
Prediction of Structure from Sequence Flowchart Comparison of query sequence to nr database Similar to a sequence of known structure? Homology Modeling (Comparative Modeling) No Fold Recognition (Threading) Fits a known fold? Yes Ab initio prediction No
The Rosetta framework and its prediction modes
A short history of Rosetta In the beginning: ab initio modeling of protein structure starting from sequence Short fragments of known proteins are assembled by a Monte Carlo strategy to yield native-like protein conformations Reliable fold identification for short proteins. Recently improved to high-resolution models (within 2A RMSD) ATCSFFGRKLL…..
A short history of Rosetta Success of ab initio protocol lead to extension to Protein design Design of new fold: TOP7 Protein loop modeling; homology modeling Protein-protein docking; protein interface design Protein-ligand docking Protein-DNA interactions; RNA modeling Many more, e.g. solving the phase problem in Xray crystallography ATCSFFGRKLL…..
The Rosetta Strategy Observation: local sequence preferences bias, but do not uniquely define, the local structure of a protein Goal: mimic interplay of local and global interactions that determine protein structure Local interactions: fragments derived from known structures (sampled for similar sequences/secondary structure propensity) Global (non-local) interactions: buried hydrophobic residues, paired strands, specific side chain interactions, etc
The Rosetta Strategy Local interactions – fragments – Fragment library representing accessible local structures for all short sequences in a protein chain, derived from known structures Global (non-local) interactions – scoring function – Derived from conformational statistics of known structures
Scoring and Sampling
The basic assumption in structure prediction Native structure located in global minimum (free) energy conformation (GMEC) ➜ A good Energy function can select the correct model among decoys ➜ A good sampling technique can find the GMEC in the rugged landscape E E GMEC Conformation space
Two-Step Procedure 1.Low-resolution step locates potential minima (fast) 2.Cluster analysis identifies broadest basins in landscape 3.High-resolution step can identify lowest energy minimum in the basins (slow) GMEC E E Conformation space
Structure Representation: Equilibrium bonds and angles (Engh & Huber 1991) Centroid: average location of center of mass of side- chain (Centroid | aa, , ) No modeling of side chains Fast Low-Resolution Step
Bayes Theorem: Independent components prevent over-counting P(str | seq) = P(str)*P(seq|str) / P(seq) Low-Resolution Scoring Function constant sequence- dependent features sequence- dependent features structure dependent features structure dependent features
Bayes Theorem: P(seq | str) P(str | seq) = P(str) * P(seq | str) / P(seq) Score = S env + S pair + … neighbors: C -C <10Ǻ Sequence-Dependent Components Rohl et al. (2004) Methods in Enzymology 383:66 Origin: Simons et al., JMB 1997; Simons et al., Proteins 1999
P(str) P(str | seq) = P(str) * P(seq | str) / P(seq) Score = … + Sr g + Sc + S vdw + … Structure-Dependent Components
P(str) P(str | seq) = P(str) * P(seq | str) / P(seq) Score = … + S ss + … Structure-Dependent Components
P(str) P(str | seq) = P(str) * P(seq | str) / P(seq) Score = … + S sheet + S hs + … + S rama 10 Structure-Dependent Components
Slow, exact step Locates global energy minimum Structure Representation: All-atom (including polar and non-polar hydrogens, but no water) Side chains as rotamers from backbone-dependent library Side chain conformation adjusted frequently High-Resolution Step Dunbrack 1997
Side chains have preferred conformations They are summarized in rotamer libraries Select one rotamer for each position Best conformation: lowest-energy combination of rotamers High-Resolution Step: Rotamer Libraries Serine 1 preferences t=180 o g - =-60 o g + =+60 o
High-Resolution Scoring Function Major contributions: – Burial of hydrophobic groups away from water – Void-free packing of buried groups and atoms – Buried polar atoms form intra-molecular hydrogen bonds
Packing interactions Score = S LJ(atr + rep) + …. r ij Linearized repulsive part e: well depth from CHARMm19 High-Resolution Scoring Function
Implicit solvation Score = … + S solvation + …. Lazaridis & Karplus, Proteins 1999 solvation free energy density of i polar High-Resolution Scoring Function x ij =(r ij - R i )/ i x ij 2 x ji 2
N H OC d (Kortemme, 2003; Morozov 2004) Hydrogen Bonds (original function) Score = …. + S hb(srbb+lrbb+sc) + …. sr bb : short range, backbone HB lr bb : long range, backbone HB sc: HB with side chain atom High-Resolution Scoring Function
Hydrogen Bonding Energy Based on statistics from high-resolution structures in the Protein Data Bank (rcsb.org) (Kortemme, Morozov & Baker 2003 JMB) Slide from Jeff Gray ]
Rotamer preference Score = … + S dunbrack + …. Dunbrack, 1997 High-Resolution Scoring Function
One long, generic function …. Score = S env + S pair + Sr g + Sc + S vdw + S ss + S sheet + S hs + S rama + S hb (srbb + lrbb) + docking_score + S disulf_cent + S r + S co + S contact_prediction + S dipolar + S projection + S pc + S tether + S + S + S symmetry + S splicemsd + ….. docking_score = S d env + S d pair + S d contact + S d vdw + S d site constr + S d + S fab score Score = S LJ(atr + rep) + S solvation + S hb(srbb+lrbb+sc) + S dunbrack + S pair – S ref + S prob1b + S intrares + S gb_elec + S gsolt + S h2o (solv + hb) + S _plane Scoring Function: Summary
Representations of protein structure: Cartesian and polar coordinates Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI …. … PDB x y z ATOM 490 N GLN A N ATOM 491 CA GLN A C ATOM 492 C GLN A C ATOM 493 O GLN A O ….. ….
2 ways to represent the protein structure Cartesian coordinates (x,y,z; pdb format) Intuitive – look at molecules in space Easy calculation of energy score (based on atom- atom distances) – Difficult to change conformation of structure (while keeping bond length and bond angle unchanged) Polar coordinates ( equilibrium angles and bond lengths) Compact (3 values/residue) Easy changes of protein structure (turn around one or more dihedral angles) – Non-intuitive – Difficult to evaluate energy score (calculation of neighboring matrix complicated)
A snake in the 2D world Cartesian representation: points: (0,0),(1,1),(1,2),(2,2),(3,3) connections (predefined): 1-2,2-3,3-4,4-5 x y (0,0) (1,1) (1,2) (2,2) (3,3)
A snake in the 2D world Internal coordinates: bond lengths (predefined): √2,1,1,√2 angles: 45 0,90 o,0 o,45 o x y √ x y 45 o 90 o From wikipedia
A snake wiggling in the 2D world Constraint: keep bond length fixed Move in Cartesian representation (0,0),(1,1),(1,2),(2,2),(3,3) (0,0),(1,1),(1,2),(2,2),(3,0) Bond length changed! x y √2 √3
A snake wiggling in the 2D world Constraint: keep bond length fixed Move in polar coordinates 45 0,90 o,0 o,45 o 45 0,90 o,45 o,45 o Bond length unchanged! Large impact on structure x y
Polar Cartesian coordinates Convert r and to x and y (0,0),(1,1),(1,2),(2,2),(3,3) 45 0,90 o,0 o,45 o √2,1,1,√2 x y From wikipedia
Cartesian polar coordinates Convert x and y to r and (0,0),(1,1),(1,2),(2,2),(3,3) 45 0,90 o,0 o,45 o √2,1,1,√2 x y
Moving the snake to the 3D world x y Cartesian representation: points: additional z-axis (0,0,0),(1,1,0),(1,2,0),(2,2,0),(3,3,0) connections (predefined): 1-2,2-3,3-4,4-5 Internal coordinates: bond lengths (predefined): √2,1,1,√2 angles: 45 0,90 o,0 o,45 o dihedral angles: 180 0,180 o z Proteins: bond lengths and angles fixed. Only dihedral angles are varied
Dihedral angles Dihedral angles 1 - 4 define side chain From wikipedia Dihedral angle: defines geometry of 4 consecutive atoms (given bond lengths and angles)
What we learned from our snake x y Cartesian representation: Easy to look at, difficult to move – Moves do not preserve bond length (and angles in 3D) Internal coordinates: Easy to move, difficult to see – calculation of distances between points not trivial z Proteins: bond lengths and angles fixed. Only dihedral angles are varied
Solution: toggle CALCULATE ENERGY - Cartesian coordinates: Derive distance matrix (neighbor list) for energy score calculation CALCULATE ENERGY - Cartesian coordinates: Derive distance matrix (neighbor list) for energy score calculation Transform: build positions in space according to dihedral angles PDB x y z ATOM 490 N GLN A N ATOM 491 CA GLN A C ATOM 492 C GLN A C ATOM 493 O GLN A O ….. …. MOVE STRUCTURE - Polar coordinates: introduce changes in structure by rotating around dihedral angle(s) (change values) MOVE STRUCTURE - Polar coordinates: introduce changes in structure by rotating around dihedral angle(s) (change values) Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI …. … Transform: calculate dihedral angles from coordinates (0,0),(1,1),(1,2),(2,2),(3,3)45 0,90 o,0 o,45 o
Cartesian polar coordinates Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 … …. … PDB x y z … ATOM 490 C GLN A N ATOM 491 N GLY A C ATOM 492 CA GLY A C ATOM 493 O GLY A O ….. …. How to calculate polar from Cartesian coordinates: example : C’-N-Ca-C – define plane perpendicular to N-Ca (b 2 ) vector – calculate projection of Ca-C (b 3 ) and C’-N (b 1 ) onto plane – calculate angle between projections (0,0),(1,1),(1,2),(2,2),(3,3)45 0,90 o,0 o,45 o
Polar Cartesian coordinates Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 … …. … PDB x y z … ATOM 490 C GLN A N ATOM 491 N GLY A C ATOM 492 CA GLY A C ATOM 493 O GLY A O ….. …. Find x,y,z coordinates of C, based on atom positions of C’, N and Ca, and a given value ( : C’-N-Ca-C) create Ca-C vector: – size Ca-C=1.51A (equilibrium bond length) – angle N-Ca-C= 111 o (equilibrium value for N-Ca-C angle) rotate vector around N-Ca axis to obtain projections of Ca-C and N-C’ with wanted (0,0),(1,1),(1,2),(2,2),(3,3) 45 0,90 o,0 o,45 o
Representation of protein structure Rosetta folding 3 backbone dihedral angles per residue Sampling and minimization in TORSIONAL space: change angle and rebuild, starting from changed angle Build coordinates of structure starting from first atom, according to dihedral angles (and equilibrium bond length and angle) Based on slides by Chu Wang
Representation of protein structure ’3’1’2’8’7’5’6’ Backbone dihedral angles fixed (rigid-body) Rosetta folding 3 backbone dihedral angles per residue Rosetta docking 6 rigid-body DOFs -- 3 translational vectors 3 rotational angles Sampling and minimization in TORSIONAL space Sampling and minimization in RIGID-BODY space How can those two types of degrees of freedom be combined?
Fold tree representation “long-range” edge – 6 rigid-body DOFs 4’3’1’2’8’7’5’6’ “peptide” edge – 3 backbone dihedral angles Example: fold-tree based docking Originally developed to improve sampling of strand registers in -sheet proteins. Allows simultaneous optimization of rigid-body and backbone/sidechain torsional degrees of freedom. Fold tree: Bradley and Baker, Proteins (2006) 4’3’1’2’8’7’5’6’ Construct fold-trees to treat a variety of protein folding and docking problems.
Fold-trees for different modeling tasks protein folding NC N: N-terminal; C: C-terminal; X: chain break; O: root of the tree; Flexible “peptide” edgerigid “peptide” edge 11’ rigid “jump” 11’ flexible “jump” Color – flexible bb Gray – fixed bb
Fold-trees for different modeling tasks N11’C22’xx loop modeling N: N-terminal; C: C-terminal; X: chain break; O: root of the tree; Flexible “peptide” edgerigid “peptide” edge 11’ rigid “jump” 11’ flexible “jump” Color – flexible bb Gray – fixed bb
Fold-trees for different modeling tasks N1C N1’C fully flexible docking N: N-terminal; C: C-terminal; X: chain break; O: root of the tree; Flexible “peptide” edgerigid “peptide” edge 11’ rigid “jump” 11’ flexible “jump” N1C N1’C docking w/ hinge motion N1 N1’C 22’xC 3’3x docking w/ loop modeling Color – flexible bb Gray – fixed bb
Fold-trees for different modeling tasks Color – flexible bb Gray – fixed bb Pale – symmetry operation
Fold-trees for different modeling tasks Color – flexible bb Gray – fixed bb Filled colored circles - flexible sc
Fold-trees for different modeling tasks Color – flexible bb Gray – fixed bb Filled colored circles - flexible sc o empty colored circles – flexible amino acid: design
Fold-trees for different modeling tasks Color – flexible bb Gray – fixed bb Filled colored circles - flexible sc o empty colored circles – flexible amino acid: design
The Rosetta sampling strategy: a general overview 9 residue fragments 3 residue fragments Gradual addition of parameters to scoring function Quick quenching Fragment Sampling Strategies to keep fragment insertion/perturbation local Monte Carlo (MC) Sampling MC sampling with minimization Local optimization Repacking and refinement Side chain rearrangement