Protein Structure Prediction Ming-Jing Hwang (黃明經) N121, Institute of Biomedical Sciences Academia Sinica http://gln.ibms.sinica.edu.tw/
Objective To understand the what, why, and how aspects of protein structure prediction, as well as its current status and use.
Science 2005
Why structure? Most proteins fold to function. Structure allows us to understand how a protein functions, often with mechanistic details, more than sequence can. With the knowledge we can design experiments to further probe the protein’s function, or, in the case of a disease protein, devise ways to counter the disease process (e.g. drug design).
Ex: Structure & function of potassium channel MacKinnon, 1998 (2003 Nobelist)
Some structure biology Nobel winners F. H. C. Crick, J. D. Watson, M. H. C. Wilkins (Physiology or Medicine, 1962)for their discoveries concerning the molecular structure of nuclear acids and its significance for information transfer in living material M. F. Perutz, Sir J. C. Kendrew (Chemistry, 1962)for their studies of the structures of globular proteins D. Crowfoot Hodgkin (Chemistry, 1964)for her determinations by X-ray techniques of the structures of important biochemical substances Sir A. Klug (Chemistry, 1982) for his development of crystallographic electron microscopy and his structural elucidation of biologically important nuclei acid-protein complexes J. Deisenhofer, R. Huber, H. Michel (Chemistry, 1988) for the determination of the three-dimensional structure of a photosynthetic reaction centre P. D. Boyer, J. E. Walker, J. C. Skou (Chemistry, 1997)for their elucidation of the enzymatic mechanism underlying the synthesis of adenosine triphosphate (ATP) [Boyer, Walker] for the first discovery of an ion-transporting enzyme, Na+, K+ -ATPase [Skou] J. B. Fenn, K. Tanaka, K. Wüthrich (Chemistry, 2002) for the development of methods for identification and structure analyses of biological macromolecules for their development of soft desorption ionisation methods for mass spectrometric analyses of biological macromolecules [Fenn, Tanaka] for his development of nuclear magnetic resonance spectroscopy for determining the three-dimensional structure of biological macromolecules in solution[Wüthrich] R. D. Kornberg (Chemistry, 2006)for his studies of the molecular basis of eukaryotic transcription V. Ramakrishnan, T.A. Steitz, A.E. Yonath (Chemistry 2009)for studies of the structure and function of the ribosome http://www.imb-jena.de/IMAGE_NOBEL.html
Why prediction? Structure determination by experimental methods (X-ray, NMR, etc.) is still hard, especially with obstacles at early steps (e.g. expression and crystallization) To bridge the widening gap between sequence and structure
Sequence/Structure Gap As of June 02, 2009, the number of entries in protein sequence and structure database: SWISS-PROT/TREMBL : 468,851/7,916,844 PDB : 57,835 Sequence Structure
Structural genomics and drug design Structural Genomics: HM as work horse Structural genomics and drug design Baker & Sali, 2001
Structure prediction: 1D->3D, then Function MADWVTGKVTKVQNWTDALFSLTVHAPVLPFTAGQFTKLGLEIDGERVQRAYSYVNSPDNPDLEFYLVTVPDGKLSPRLAALKPGDEVQVVSEAAGFFVLDEVPHCETLWMLATGTAIGPYLSILR UNKNOWN KNOWN
Prediction is very hard, especially if you are predicting unknowns.
Why do we believe in prediction at all? Christian Anfinson, in an elegant experiment in 1957, showed that ribonuclease A (124 aa’s), after having been completely denatured using 8M urea and 2-mercapto-ethanol, regained full enzymatic activity when the urea and 2-ME were slowly removed by dialysis. All the information needed to fold is contained within the primary sequence. (1957)
Theory of Structure Prediction Energy Landscape Theory of Structure Prediction Nature makes the landscapes of real proteins funneled. You have to work to make the energy landscapes of structure prediction schemes funneled. Let me show you some of the things you have to consider. Zaida (Zan) Luthey-Schulten
How to do 1D3D? (I) Physics-based approach: computing energy as a function of structure (surfing the energy surface)
Molecular Mechanics (Force Field) http://cmm.info.nih.gov/modeling/guide_documents/molecular_mechanics_document.html
Levitt
A POP study: 1-microsecond MD simulation 980ns villin headpiece 36 a.a. 3000 H2O 12,000 atoms 256 CPUs (CRAY) ~4 months single trajectory Duan & Kollman, 1998
Science 2010 (1 millisecond; previous longest 10 microsecond; Amber FF) Fig. 1 Folding proteins at x-ray resolution, showing comparison of x-ray structures (blue) (15, 24) and last frame of MD simulation (red): (A) simulation of villin at 300 K, (B) simulation of FiP35 at 337 K. Simulations were initiated from completely extended structures. Villin and FiP35 folded to their native states after 68 µs and 38 µs, respectively, and simulations were continued for an additional 20 µs after the folding event to verify the stability of the native fold.
Massively distributed computing Letters to nature (2002) engineered protein (BBA5) zinc finger fold (w/o metal) 23 a.a. solvation model thousands of trajectories each of 5-20 ns, totaling 700 ms Folding@home 30,000 internet volunteers several months, or ~a million CPU days of simulation
Worldwide distributed computing Pande group
Massively distributed computing SETI@home: Folding@home FightAIDS@home …
The problem: timescales Bond vibration Isomeris- ation Water dynamics Helix forms Fastest folders typical folders slow folders 10-15 femto 10-12 pico 10-9 nano 10-6 micro 10-3 milli 100 seconds MD step long MD run where we need to be where we’d love to be 16 order of magnitude range Femtosecond timesteps Need to simulate micro to milliseconds Pande group
Biology Can’t Wait! (Evolution to rescue) One Big Family.
How to do 1D->3D ab initio How to do 1D->3D ab initio? (II) Biology-based approach: data (knowledge)-mining Ignore the actual folding process in cell, instead focus on the end point!
The 123 (1D fragment3D) approach Primary LGINCRGSSQCGLSGGNLMVRIRDQACGNQGQTWCPGERRAKVCGTGNSISAYVQSTNNCISGTEACRHLTNLVNHGCRVCGSDPLYAGNDVSRGQLTVNYVNSC seq. to str. mapping fragment (structural motifs) Tertiary fragment assembly
The I-sites library (Baker’s group)
Fragment insertion Monte Carlo Rosetta: a folding simulation program (a try and error process) Fragment insertion Monte Carlo backbone torsion angles accept or reject fragments Energy function Choose a fragment change backbone angles evaluate Convert to 3D http://www.cs.huji.ac.il/ course/ 2002/ cbio/ handouts/ Class8
Does it work? The CASP experiments
One lab dominated in CASP4 Baker’s group dominates the ab initio (knowledge-based) prediction in CASP4 One lab dominated in CASP4
Some CASP4 successes Baker’s group
# of residues with cRMS below 4Å/6Å ROSETTA results at CASP5 # of residues with cRMS below 4Å/6Å Name Length human Automatic Best decoy T135 106 83/98 54/64 94/105 T149 116 52/71 44/62 76/92 T161 154 45/83 57/79 55/95 Rosetta predictions in CASP5: Successes, failures, and prospect for complete automation. Baker et all, Proteins, 53:457-468 (2003)
Toward High-Resolution de Novo Structure Prediction for Small Proteins --Philip Bradley, Kira M. S. Misura, David Baker (Science 2005) The prediction of protein structure from amino acid sequence is a grand challenge of computational molecular biology. By using a combination of improved low- and high-resolution conformational sampling methods, improved atomically detailed potential functions that capture the jigsaw puzzle–like packing of protein cores, and high-performance computing, high-resolution structure prediction (<1.5 angstroms) can be achieved for small protein domains (<85 residues). The primary bottleneck to consistent high-resolution prediction appears to be conformational sampling.
Still, not practical for most … Small proteins Expensive (computationally): sampling Not for everyday biologists …
HM: the poor man’s solution
Similar sequences
Similar structures with low sequence similarity 9% sequence identity Shapiro & Harris, 2000
Another example FtsZ and tubulin would not be recognized as homologous by sequence comparison Burns, R., Nature 391:121-123 (1998)
Fold recognition Query sequence Library of known folds Best-fit fold Mark Gerstein Lab
FR by threading Query sequence: Thread the sequence onto the fold template Use structural properties to evaluate the fit Environment Pairwise interactions Mark Gerstein Lab
Pitfalls of comparative (homology) modeling Difficult to detect and correct alignment errors More similar to template than to true structure Cannot predict novel folds (template may be wrong!)
Structure Prediction Methods Twilight zone Homology modeling Fold recognition ab initio 0 10 20 30 40 50 60 70 80 90 100 % sequence identity
Protein Structure Prediction clickable map http://speedy.embl-heidelberg.de/gtsp/flowchart2.html
Reliability and uses of comparative models Marti-Renom et al. (2000)
Success and limitations of structure prediction Models of large and remotely related proteins are not very accurate Domain boundaries are difficult to define Models often do not provide details for functional annotation Success: Accuracy scores almost doubled from CASP1 to CASP6, might be because of database size Models of small targets are very accurate Kryshtafovych et al 2005 http://www.bioeng.ru/ Manager/ Files/ Panchenko/ shaitan_kurs_lab2.ppt
Structural Bioinformatics: Sequence/Structure Relationship Percent Identity 100 90 80 70 60 50 40 30 20 10 All possible sequences of amino acids Protein structures observed in nature Twilight zone Midnight zone Protein sequences observed in nature
Final exam assignment Find a protein sequence of any organism sharing no greater than 40% sequence identify with any accessible entry in PDB. Predict the 3D structure of your protein using whatever method/tool/server/database. Write a ~5 page report to document how you find the sequence, how you do (or get) the prediction, and how you visualize/describe the predicted model, along with thoughts/comments on your learning process. Submit your report to Cathy by 6/22/2011. Need help? Ask, read/surf, and try it!
PDB: the one-stop shop for structure bioinformatics
Selected Structural Biology Databases, Servers and Services CASP-certified protein structure prediction servers I-TASSER ROBETTA HHpred METATASSER MULTICOM Pcons SAM-T08 3D-Jury THREADER Comparative Modeling Servers SwissModel MODELLER Protein secondary structure prediction servers PSIpred JPRED Database of protein structures PDB - Protein Data Bank Structural classifications of proteins SCOP CATH Structural neighbors database Dali Database
Thank You!
3D to 1D? Science 2003
A computer-designed protein (93 aa) with 1.2 A resolution
Structure prediction servers http://bioinfo.pl/cafasp/list.html
Hybrid approach for solving macromolecular complex structures
(Rost, 1996)
Levinthal’s paradox (1969) If we assume three possible states for every flexible dihedral angle in the backbone of a 100-residue protein, the number of possible backbone configurations is 3200. Even an incredibly fast computational or physical sampling in 10-15 s would mean that a complete sampling would take 1080 s, which exceeds the age of the universe by more than 60 orders of magnitude. Yet proteins fold in seconds or less! Berendsen
The Rosetta method DECOYS: DISCRIMINATION: Kochl Generate a large number of possible shapes DISCRIMINATION: Select the correct, native-like fold Need good decoy structures Need a good energy function Kochl
Nature 2007