In silico Protein Design: Implementing Dead-End Elimination algorithm CS273 Tyrone Anderson, Yu Bai & Caroline E. Moore-Kochlacs May 31st 2005
Computational protein design Backbone scaffold New sequence Iterative refinement Native structure Given backbone coordinates, find the best sequence(s) with which the protein is stable.
Components of the problem The protein design problem can be roughly divided into searching procedure and scoring function. The searching procedure samples the sequence space AND side-chain conformational space to create conformations. The scoring function evaluates each conformation created by the searching procedure. The evaluation scores are used to rank the conformations (and therefore the sequences) and pick the best one to be the final model.
Why is searching procedure difficult ? Consider a short protein with 20 amino acids. Possible sequence: S = 2020 ~1026 Each side chain has on average 2 dihedral angles (χ angles). Assuming that we will sample every 40º in the dihedral angle space, N = (360/40)(202) ~1038 This number S*N is too large to be naively sampled Algorithms that find good solutions by screening only parts of the search space are needed
Rotamer libraries Already in the 70s, Janin et al. showed that different side chain conformations are not found in equal distribution over the dihedral angle space but tend to cluster at specific regions of the space, much as in the Ramachandran plot. In the 80’s, this observation was used to improve modeling of side chain conformations. MAY BE SKIP Today, essentially all programs that model side chain conformations use rotamer libraries.
What do rotamer libraries provide? Rotamer libraries reduce significantly the number of conformations that need to be evaluated during the search. This is done with almost no risk of missing the real conformations. Even small libraries of about 100-150 rotamers cover about 96-97% of the conformations actually found in protein structures. The probabilities of each rotamer in the library can be applied to estimate the potential energy due to interactions within the side chain and with the local backbone atoms, using the Boltzmann distribution. (Not applied in this project) E ln(P)
Rotamer Library Creation Source: http://honiglab.cpmc.columbia.edu/programs/sidechain/rotamers.html Parsing: Select all Nitrogens (N), Oxygens (O), Alpha Carbons (CA) , & all other Carbons i.e CD, CZ, etc. Exclude all other elements and the end of file Store in a 3D array: Residues (1D) Rotamers (2D)
Rotamer Library Creation Original Rotamer: ('R', 1) Atom Residue X Y Z N R 4.986 -6.494 10.983 CA 3.99 -5.511 11.369 C 2.774 -6.288 11.848 O 2.209 -7.039 11.068 CB 3.528 -4.689 10.203 CG 4.563 -3.671 9.815 CD 4.228 -2.905 8.546 NE 5.119 -1.768 8.418 CZ 5.354 -1.136 7.268 NH1 4.736 -1.573 6.181 NH2 6.164 -0.097 7.202 HN 4.75 -7.184 10.299 HA 4.344 -4.953 12.119 1HB 3.357 -5.292 9.424 2HB 2.682 -4.216 10.45 1HG 4.656 -3.012 10.561 2HG 5.434 -4.141 9.675 1HD 4.338 -3.509 7.756 2HD 3.281 -2.586 8.592 HE 5.618 -1.384 9.195 1HH1 4.882 -1.115 5.304 2HH1 4.123 -2.361 6.237 1HH2 6.314 0.365 6.328 2HH2 6.628 0.228 8.026 HEAD 1 7.156 -7.614 11.118 2 6.202 -6.499 11.516 3 6.564 -5.625 12.293 Rotamer Library Creation Example: Black: Include in array Red: Exclude from array Blue: *Not part of the array
Aligning with the Backbone Translate backbone and rotamer to origin CA atom of ‘R’, 1 and backbone = (0, 0, 0) Rotate rotamer around X-axis Rotate rotamer around Z-axis Translate rotamer back to original position based on original position of CA atom i.e. CA atom of ‘R’, 1 = (3.99, -5.511 , 11.369)
Rotamer Library Manipulation Retrieve a specific rotamer: Provide the residue and the rotamer number i.e. ‘R’, 1 Gives you the 1st rotamer related to the Arginine residue Rotamer is already aligned with the backbone Only the coordinates of the atoms are returned in a 2D array Aligned Rotamer: ('R', 1) 5.0288367 -5.67582 10.82734 3.99 -5.511 11.369 2.7217014 -5.64128 12.04117 2.1324015 -5.76721 10.94661 3.50813 -5.37317 9.732783 4.587644 -5.20248 9.188313 4.2382361 -5.07404 7.407558 5.4126639 -4.77743 5.614174 24.594637 -4.74892 15.97595
Now, Consider again our protein of 20 amino acids. Each side chain has on average 9 rotamers. Assuming that we search now in the space of rotamers: N = 920 ≈ 1019 The searching space is restricted and oriented but the number of conformations is still too large for a naive search
Algorithms in searching (side-chain) conformational space Greed search (systematically scans the search space) DEE (Algorithmic approaches to reduce the search space) Self consistent algorithms (iterative sequential procedure) Monte Carlo algorithms (random search)
DEE (Dead-End Elimination) Aims to safely eliminates (clusters of) rotamers without loosing the GMEC (Global Minimum Energy Conformation). rotamer ir in force field of backbone only rotamer ir with rotamer(s) of other residues Given residue i, eliminate a rotamer ir if the minimum energy it can obtain by interaction with conformational background (js) is higher (worse) than the maximum possible energy that another rotamer it (of the same residue) can have
E(i,j) is it js rotamer background Desmet et al., 1992
The Goldstein improvement Rotamer ir can be safely eliminated when some other rotamer it exists with lower (better) energy for a certain environment that mostly favors ir. This criteria is much less restrictive and therefore more powerful. It requires though more computational time.
The Goldstein improvement is E(i,j) it js rotamer background
Scoring function: Energy function Terms: Van der Waals represents packing specificity Hydrogen bonding typically represented by an angle dependent, 12-10 hydrogen bond potential Electrostatics Guard against destabilizing interactions between like charged residues Internal coordinate terms ‘bonded’ energies Solvation energy Protein-solvent interactions Entropy Assumes conformational space is completely restricted in the folded state Gordon et al, 1999
Scoring function: Energy function Terms: Van der Waals represents packing specificity Hydrogen bonding typically represented by an angle dependent, 12-10 hydrogen bond potential Electrostatics Guard against destabilizing interactions between like charged residues Internal coordinate terms ‘bonded’ energies Solvation energy Protein-solvent interactions Entropy Assumes conformational space is completely restricted in the folded state Gordon et al, 1999
Van der Waals Interaction between two uncharged atoms Mildly attractive as two atoms approach from a distance Repulsive as they approach too close Represents packing specificity Prefers native-like folded states with well-organized cores over disordered or molten-globule states Gordon et al, 1999
Van der Waals 12-6 Lennard-Jones potential Standard approximation http://employees.csbsju.edu/hjakubowski/classes/ch331/protstructure/ilennardjones2.gif 12-6 Lennard-Jones potential Standard approximation R = distance between atoms R0 = van der Waals radii Dij = well depth Variation from Kuhlman and Baker, 2004 Erep is dampened to account for the fixed backbone and rotamer set being used.
Electrostatics Stability Specificity Moderate temperatures: favorable electrostatic interactions not thought to be strong enough to compensate for the energy of desolvation Extreme conditions: salt bridges may stabilize Specificity folding and functional interactions maybe the more significant role of electrostatics Currently, term guards against destabilizing interactions between like-charged residues Gordon et al, 1999
Electrostatics Approximations: Coulomb’s Law (Gordon et al, 1999) Qi,Qj = charge on amino acid R = distance ε= dielectric constant = 40 Bayesian version (Kuhlman & Baker, 2004) Probability of two amino acids close together given environment and distance (from PDB) aa=amino acid, d = distance, env =environment
Solvation Hydrophobic effects drive folding, modeling solvation effects is critical to a protein design force field Computationally expensive Solvent model from Lazaridis and Karplus, 1999 dij = distance between atoms, rij = van der Waals radii, Vi = atomic volume ΔGref = reference solvation free energy, ΔGfree = solvation free energy of free (isolated) group λ = correlation length
Energy Function: Incomplete model Current standard models include Bayesian terms based on PDB statistics Several terms have not been thoroughly validated as useful for design (Gordon et al, 1999) Hydrogen bonding Electrostatics Internal coordinates Current standard models are ad hoc, physical quantities and variables are weighted based on “what works best”
Integrated algorithm schema ..N1 I2L3D2E1F2. .. . . . . .D1. . . ..N1 L2L3K2N1V1. .. ..W7L3D2K9K10G1. .. Best seq . . . . .D2. . . 2nd DEE Exhaustive search 1st order DEE ..N1 . . .D2... Iterating algorithm till no more rotamers can be eliminated. only one rotamer is left for each of several side chains (i.e., these are part of the GMEC). For several others, only a (hopefully) few are left. DEE is extended to the Second order: cluster size =2, only core design finished An additional exhaustive search is then applied to obtain the final model D … N N1 N2 N3 . . . N1 . . .
Design cold-shock protein (core) & Trp-Cage protein Cold-shock protein (1MJC.pdb) 10 residues (core) Trp-Cage(1L2Y.pdb) 20 residues
cold-shock protein (core) After 1st-order DEE 2 3 7 8 9 5 4 1 6 Residue 0 A 1 F 2 3 I L V W 4 5 6 7 … Residue 3 Residue 8 Residue 1 A 1 F 2 I 3 L V W 5 6 … Residue 3 7 Residue 8 Hydrophobic Amino acids: A (1), F (3), I (3), L (2), V (2), W(7)
Trp-Cage protein All 20 AA After 1st-order DEE . . . Residue 9 H: 1...8 I: 1,2,3 K: 1...87 L: 1,2 M: 1...17 N: 1...9 P: 1 Q: 1...30 R: 1...114 S: 1,2 T: 1 V: 1,2 W:1...7 Y: 1,2,3 After 1st-order DEE . . . Residue 9 A: 1 C: 2 D: 6 E: 6,15 F: 1 G: 1 H: 7 I: 2 K: 18,22,59 L: 1 M: 1,12 N: 6 P: 1 Q: 4 R: 7,107 S: 2 T: 1 V: 1,2 W:6 Y: 1 All 20 AA Both finished the first order DEE, then go to 2nd order DEE, Trp-Cage is still running. core design finished two rounds, and further Evaluation for the best sequences are applied.
Results for cold-shock protein (core) Seq. EScore N: V F I V V I L V F V -46.47 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. . . . I F I I I I L I F V -53.58 I F I I V I L I F V -52.48 V F I I I I L I F V -51.70 I F I I I I L V F V -50.72 V F I I V I L I F V -50.53 I F I I V I L V F V -49.63 I F V V I I L I F V -49.34 V F V I I I L I F V -49.23 I F V I I I L V F V -48.92 V F I I I I L V F V -48.88 Cold-shock protein (1MJC.pdb) 10 residues (core) We are very happy to see the predicted sequences are at minimum 50% identical to the native one, and the rest residues are limited to 2 amino acids which include the native ones. Secondly, there is high similarity of the energies of the sequences which suggests the global E minima of conformational space is smooth and our solution is likely to be stable.
Summary & Future Speed Accuracy Achievement: Naïve ~ 107 sequence X 104 rotamers DEE ~ 3000 sequences X 200 rotamers BioX-cluster(~600 2.8GHz Xeon CPUs) 26 hrs Future: Rotamers ordering (by self-energies) (Gordon 1998) Comparison cluster focusing (Looger 2001) Stronger elimination criteria (Looger 2001) Accuracy Achievement: 50 % identical with native sequence High similarity in total energy Future: Additional energy terms (H-bond, solvation) Incorporate rigorous force field calculators(Gromacs) Structure relaxation Have not had chance to compare with other algorithms( greedy, MC), but simplest demenstration is naïve search needs
Thanks !