Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115
2 Protein Structure Prediction Ram Samudrala University of Washington
STAT1153 Outline Motivations and introduction Protein 2 nd structure predictionProtein 2 nd structure prediction Protein 3D structure prediction –CASPCASP –Homology modelingHomology modeling –Fold recognitionFold recognition –ab initio predictionab initio prediction –Manual vs automationManual vs automation Structural genomics
STAT1154 Protein Structure Sequence determines structure, structure determines function Most proteins can fold by itself very quickly Folded structure: lowest energy state
5 Protein Structure Main forces for considerations –Steric complementarity –Secondary structure preferences (satisfy H bonds) –Hydrophobic/polar patterning –Electrostatics
6 Rationale for understanding protein structure and function Protein sequence -large numbers of sequences, including whole genomes Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution ? structure determination structure prediction homology rational mutagenesis biochemical analysis model studies Protein structure - three dimensional - complicated - mediates function
STAT1157 Protein Databases SwissProt: protein knowledgebaseSwissProt PDB: Protein Data Bank, 3D structurePDB
8 View Protein Structure Free interactive viewers Download 3D coordinate file from PDB Quick and dirty: –VRML –Rasmol –Chime More powerful –Swiss-PdbViewer
9 Compare Protein Structures Structure is more conserved than sequence Why compare? –Detect evolutionary relationships –Identify recurring structural motifs –Predicting function based on structure –Assess predicted structures Protein structure comparison and classification –Manual: SCOP –Automated: DALI
10 Compare protein structures Need ways to determine if two protein structures are related and to compare predicted models to experimental structures Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979): Usually use C atoms 3.6 Å2.9 Å NK-lysin (1nkl)Bacteriocin T102/as48 (1e68)T102 best model Other measures include contact maps and torsion angle RMSDs
STAT11511 SCOP Compare protein structure, identify recurring structural motifs, predict function A. Murzin et al, 1995 –Manual classification –A few folds are highly populated –5 folds contain 20% of all homologous superfamilies –Some folds are multifunctional
STAT11512 Determine Protein Structure X-ray crystallography (gold standard) –Grow crystals, rate limiting, relies on the repeating structure of a crystalline lattice –Collect a diffraction pattern –Map to real space electron density, build and refine structural model –Painstaking and time consuming
STAT11513 Protein Structure Prediction Since AA sequence determines structure, can we predict protein structure from its AA sequence? = predicting the three angles, unlimited DoF! Physical properties that determine fold –Rigidity of the protein backbone –Interactions among amino acids, including Electrostatic interactions van der Waals forces Volume constraints Hydrogen, disulfide bonds –Interactions of amino acids with water
14 unfolded Protein folding landscape Large multi-dimensional space of changing conformations free energy folding reaction molten globule J=10 -8 s native J=10 -3 s *G**G* barrier height
15 Protein primary structure twenty types of amino acids R H C OHOH O N H H Cα two amino acids join by forming a peptide bond R Cα H C O N H HN H C O OHOH R H R H C O N H N H C O R H R H C O N H N H C O R H each residue in the amino acid main chain has two degrees of freedom ( and the amino acid side chains can have up to four degrees of freedom 1-4
STAT nd Structure Prediction helix, sheet, turn/loop
STAT nd Structure Prediction Chou-Fasman 1974 Base on 15 proteins (2473 AAs) of known conformation, determine P , P fromP , P Empirical rules for 2 nd struct nucleation –4 H or h out of 6 AA, extends to both dir, P > 1.03, P > P , no breakers –3 H or h out of 5 AA, extends to both dir, P > 1.05, P > P , no breakers Have ~50-60% accuracy
STAT11518 P and P
STAT nd Structure Prediction Garnier, Osguthorpe, Robson, 1978 Assumption: each AA influenced by flanking positions GOR scoring tables (problem: limited dataset) Add scores, assign 2 nd with highest score
STAT nd Structure Prediction D. Eisenberg, 1986 –Plot hydrophobicity as function of sequence position, look for periodic repeats –Period = 3-4 AA, (3.6 aa / turn) –Period = 2 AA, sheet Best overall JPRED by Geoffrey Barton, use many different approaches, get consensusJPRED –Overall accuracy: 72.9%
STAT D Protein Structure Prediction CASP contest: Critical Assessment of Structure Prediction Biannual meeting since 1994 at Asilomar, CA Experimentalists: before CASP, submit sequence of to-be-solved structure to central repository Predictors: download sequence and minimal information, make predictions in three categories Assessors: automatic programs and experts to evaluate predictions quality
STAT11522 CASP Category I Homology Modeling (sequences with high homology to sequences of known structure) Given a sequence with homology > 25-30% with known structure in PDB, use known structure as starting point to create a model of the 3D structure of the sequence Takes advantage of knowledge of a closely related protein. Use sequence alignment techniques to establish correspondences between known “template” and unknown.
STAT11523 CASP Category II Fold recognition (sequences with no sequence identity (<= 30%) to sequences of known structure Given the sequence, and a set of folds observed in PDB, see if any of the sequences could adopt one of the known folds Takes advantage of knowledge of existing structures, and principles by which they are stabilized (favorable interactions)
STAT11524 CASP Category III Ab initio prediction (no known homology with any sequence of known structure) Given only the sequence, predict the 3D structure from “first principles”, based on energetic or statistical principles Secondary structure prediction and multiple alignment techniques used to predict features of these molecules. Then, some method necessary for assembling 3D structure.
STAT11525 Structure Prediction Evaluation Hydrophobic core similar? 2 nd struct identified? Energy: minimized? H-bond contacts? Compare with solved crystal structure: gold standard
26 Comparative modelling of protein structure KDHPFGFAVPTKNPDGTMNLMNWECAIP KDPPAGIGAPQDN----QNIMLWNAVIP ** * * * * * * * ** …… scan align build initial model construct non-conserved side chains and main chains refine
STAT11527 Homology Modeling Results When sequence homology is > 70%, high resolution models are possible (< 3 Å RMSD) MODELLER (Sali et al)MODELLER –Find homologous proteins with known structure and align –Collect distance distributions between atoms in known protein structures –Use these distributions to compute positions for equivalent atoms in alignment –Refine using energetics
STAT11528 Homology Modeling Results Many places can go wrong: –Bad template - it doesn’t have the same structure as the target after all –Bad alignment (a very common problem) –Good alignment to good template still gives wrong local structure –Bad loop construction –Bad side chain positioning
STAT11529 Homology Modeling Results Use of sensitive multiple alignment (e.g. PSI-BLAST) techniques helped get best alignments Sophisticated energy minimization techniques do not dramatically improve upon initial guess
STAT11530 Fold Recognition Results Also called protein threading Given new sequence and library of known folds, find best alignment of sequence to each fold, returned the most favorable one
STAT11531 Fold Recognition with Dynamic Programming Environmental class for each AA based on known folds (buried status, polarity, 2 nd struct)
STAT11532 Protein Folding with Dynamic Programming D. Eisenburg 1994 Align sequence to each fold (a string of environmental classes) Advantages: fast and works pretty well Disadvantages: do not consider AA contacts
STAT11533 Fold Recognition Results Each predictor can submit N top hits Every predictor does well on something Common folds (more examples) are easier to recognize Fold recognition was the surprise performer at CASP1. Incremental progress at CASP2, CASP3, CASP4…
STAT11534 Fold Recognition Results Alignment (seq to fold) is a big problem
STAT11535 ab initio Predict interresidue contacts and then compute structure (mild success) Simplified energy term + reduced search space (phi/psi or lattice) (moderate success) Creative ways to memorize sequence structure correlations in short segments from the PDB, and use these to model new structures: ROSETTA
36 Ab initio prediction of protein structure sample conformational space such that native-like conformations are found astronomically large number of conformations 5 states/100 residues = = select hard to design functions that are not fooled by non-native conformations (“decoys”)
37 Sampling conformational space – continuous approaches Most work in the field - Molecular dynamics - Continuous energy minimization (follow a valley) - Monte Carlo simulation - Genetic Algorithms Like real polypeptide folding process Cannot be sure if native-like conformations are sampled energy
38 Molecular dynamics Force = -dU/dx (slope of potential U); acceleration, force = m ×a(t) All atoms are moving so forces between atoms are complicated functions of time Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial Atoms move for very short times of seconds or picoseconds (ps) x(t+ t) = x(t) + v(t) t + [4a(t) – a(t- t)] t 2 /6 v(t+ t) = v(t) + [2a(t+ t)+5a(t)-a(t- t)] t/6 U kinetic = ½ Σ m i v i (t) 2 = ½ n K B T Total energy (U potential + U kinetic ) must not change with time new position old position new velocity old velocity acceleration n is number of coordinates (not atoms)
39 Energy minimization For a given protein, the energy depends on thousands of x,y,z Cartesian atomic coordinates; reaching a deep minimum is not trivial Furthermore, we want to minimize the free energy, not just the potential energy. energy number of steps deep minimum starting conformation
40 Monte Carlo Simulation Propose moves in torsion or Cartesian conformation space Evaluate energy after every move, compute E Accept the new conformation based on If run infinite time, the simulated conformation follows the Boltzmann distribution Many variations, including simulated annealing and other heuristic approaches.
41 Scoring/energy functions Need a way to select native-like conformations from non-native ones Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms. Knowledge-based scoring functions: –Derive information about atomic properties from a database of experimentally determined conformations –Common parameters include pairwise atomic distances and amino acid burial/exposure.
STAT11542 Rosetta D. Baker, U. Wash Break sequence into short segments (7-9 AA) Sample 3D from library of known segment structures, parallel computation Use simulated annealing (metropolis-type algorithm) for global optimization –Propose a change, if better energy, take; otherwise take at smaller probability Create 1000 structures, cluster and choose one representative from each cluster to submit
STAT11543 Manual Improvements and Automation Very often manual examination could improve prediction –Catch errors –Need domain knowledge –A. Murzin’s success at CASP2 CAFASP: Critical Assessment of Fully Automated Structure Prediction –Murzin Can’t play!! MetaServers: combine different methods to get consensus
STAT11544 CAFASP Evaluation
STAT11545 Structural Genomics With more and more solved structures and novel folds, computational protein structure prediction is going to improve Structural genomics:Structural genomics –Worldwide initiative to high throughput determine many protein structures –Especially, solve structures that have no homology
STAT11546 Summary Protein structures: 1 st, 2 nd, 3 rd, 4 th –Different DB: SwissProt, PDB and SCOP –Determine structure: X-ray crystallography Protein structure prediction: –2 nd structure prediction –Homology modeling –Fold recognition –Ab initio –Evaluation: energy, RMSD, etc –CASP and CAFASP contest Manual improvement and combination of computational approaches work better Structural Genomics, still very difficult problem…
STAT11547 Acknowledgement Amy Keating Michael Yaffe Mark Craven Russ Altman