Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115.

Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115

2 Protein Structure Prediction Ram Samudrala University of Washington

STAT1153 Outline Motivations and introduction Protein 2 nd structure predictionProtein 2 nd structure prediction Protein 3D structure prediction –CASPCASP –Homology modelingHomology modeling –Fold recognitionFold recognition –ab initio predictionab initio prediction –Manual vs automationManual vs automation Structural genomics

STAT1154 Protein Structure Sequence determines structure, structure determines function Most proteins can fold by itself very quickly Folded structure: lowest energy state

5 Protein Structure Main forces for considerations –Steric complementarity –Secondary structure preferences (satisfy H bonds) –Hydrophobic/polar patterning –Electrostatics

6 Rationale for understanding protein structure and function Protein sequence -large numbers of sequences, including whole genomes Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution ? structure determination structure prediction homology rational mutagenesis biochemical analysis model studies Protein structure - three dimensional - complicated - mediates function

STAT1157 Protein Databases SwissProt: protein knowledgebaseSwissProt PDB: Protein Data Bank, 3D structurePDB

8 View Protein Structure Free interactive viewers Download 3D coordinate file from PDB Quick and dirty: –VRML –Rasmol –Chime More powerful –Swiss-PdbViewer

9 Compare Protein Structures Structure is more conserved than sequence Why compare? –Detect evolutionary relationships –Identify recurring structural motifs –Predicting function based on structure –Assess predicted structures Protein structure comparison and classification –Manual: SCOP –Automated: DALI

10 Compare protein structures Need ways to determine if two protein structures are related and to compare predicted models to experimental structures Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979): Usually use C  atoms 3.6 Å2.9 Å NK-lysin (1nkl)Bacteriocin T102/as48 (1e68)T102 best model Other measures include contact maps and torsion angle RMSDs

STAT11511 SCOP Compare protein structure, identify recurring structural motifs, predict function A. Murzin et al, 1995 –Manual classification –A few folds are highly populated –5 folds contain 20% of all homologous superfamilies –Some folds are multifunctional

STAT11512 Determine Protein Structure X-ray crystallography (gold standard) –Grow crystals, rate limiting, relies on the repeating structure of a crystalline lattice –Collect a diffraction pattern –Map to real space electron density, build and refine structural model –Painstaking and time consuming

STAT11513 Protein Structure Prediction Since AA sequence determines structure, can we predict protein structure from its AA sequence? = predicting the three angles, unlimited DoF! Physical properties that determine fold –Rigidity of the protein backbone –Interactions among amino acids, including Electrostatic interactions van der Waals forces Volume constraints Hydrogen, disulfide bonds –Interactions of amino acids with water

14 unfolded Protein folding landscape Large multi-dimensional space of changing conformations free energy folding reaction molten globule J=10 -8 s native J=10 -3 s *G**G* barrier height

15 Protein primary structure twenty types of amino acids R H C OHOH O N H H Cα two amino acids join by forming a peptide bond R Cα H C O N H HN H C O OHOH R H R H C O N H N H C O R H R H C O N H N H C O R H             each residue in the amino acid main chain has two degrees of freedom (  and  the amino acid side chains can have up to four degrees of freedom  1-4 

STAT11516 2 nd Structure Prediction  helix,  sheet, turn/loop

STAT11517 2 nd Structure Prediction Chou-Fasman 1974 Base on 15 proteins (2473 AAs) of known conformation, determine P , P  fromP , P   0.5-1.5 Empirical rules for 2 nd struct nucleation –4 H  or h  out of 6 AA, extends to both dir, P  > 1.03, P  > P , no  breakers –3 H  or h  out of 5 AA, extends to both dir, P  > 1.05, P  > P , no  breakers Have ~50-60% accuracy

STAT11518 P  and P 

STAT11519 2 nd Structure Prediction Garnier, Osguthorpe, Robson, 1978 Assumption: each AA influenced by flanking positions GOR scoring tables (problem: limited dataset) Add scores, assign 2 nd with highest score

STAT11520 2 nd Structure Prediction D. Eisenberg, 1986 –Plot hydrophobicity as function of sequence position, look for periodic repeats –Period = 3-4 AA,  (3.6 aa / turn) –Period = 2 AA,  sheet Best overall JPRED by Geoffrey Barton, use many different approaches, get consensusJPRED –Overall accuracy: 72.9%

STAT11521 3D Protein Structure Prediction CASP contest: Critical Assessment of Structure Prediction Biannual meeting since 1994 at Asilomar, CA Experimentalists: before CASP, submit sequence of to-be-solved structure to central repository Predictors: download sequence and minimal information, make predictions in three categories Assessors: automatic programs and experts to evaluate predictions quality

STAT11522 CASP Category I Homology Modeling (sequences with high homology to sequences of known structure) Given a sequence with homology > 25-30% with known structure in PDB, use known structure as starting point to create a model of the 3D structure of the sequence Takes advantage of knowledge of a closely related protein. Use sequence alignment techniques to establish correspondences between known “template” and unknown.

STAT11523 CASP Category II Fold recognition (sequences with no sequence identity (<= 30%) to sequences of known structure Given the sequence, and a set of folds observed in PDB, see if any of the sequences could adopt one of the known folds Takes advantage of knowledge of existing structures, and principles by which they are stabilized (favorable interactions)

STAT11524 CASP Category III Ab initio prediction (no known homology with any sequence of known structure) Given only the sequence, predict the 3D structure from “first principles”, based on energetic or statistical principles Secondary structure prediction and multiple alignment techniques used to predict features of these molecules. Then, some method necessary for assembling 3D structure.

STAT11525 Structure Prediction Evaluation Hydrophobic core similar? 2 nd struct identified? Energy: minimized? H-bond contacts? Compare with solved crystal structure: gold standard

26 Comparative modelling of protein structure KDHPFGFAVPTKNPDGTMNLMNWECAIP KDPPAGIGAPQDN----QNIMLWNAVIP ** * * * * * * * ** …… scan align build initial model construct non-conserved side chains and main chains refine

STAT11527 Homology Modeling Results When sequence homology is > 70%, high resolution models are possible (< 3 Å RMSD) MODELLER (Sali et al)MODELLER –Find homologous proteins with known structure and align –Collect distance distributions between atoms in known protein structures –Use these distributions to compute positions for equivalent atoms in alignment –Refine using energetics

STAT11528 Homology Modeling Results Many places can go wrong: –Bad template - it doesn’t have the same structure as the target after all –Bad alignment (a very common problem) –Good alignment to good template still gives wrong local structure –Bad loop construction –Bad side chain positioning

STAT11529 Homology Modeling Results Use of sensitive multiple alignment (e.g. PSI-BLAST) techniques helped get best alignments Sophisticated energy minimization techniques do not dramatically improve upon initial guess

STAT11530 Fold Recognition Results Also called protein threading Given new sequence and library of known folds, find best alignment of sequence to each fold, returned the most favorable one

STAT11531 Fold Recognition with Dynamic Programming Environmental class for each AA based on known folds (buried status, polarity, 2 nd struct)

STAT11532 Protein Folding with Dynamic Programming D. Eisenburg 1994 Align sequence to each fold (a string of environmental classes) Advantages: fast and works pretty well Disadvantages: do not consider AA contacts

STAT11533 Fold Recognition Results Each predictor can submit N top hits Every predictor does well on something Common folds (more examples) are easier to recognize Fold recognition was the surprise performer at CASP1. Incremental progress at CASP2, CASP3, CASP4…

STAT11534 Fold Recognition Results Alignment (seq to fold) is a big problem

STAT11535 ab initio Predict interresidue contacts and then compute structure (mild success) Simplified energy term + reduced search space (phi/psi or lattice) (moderate success) Creative ways to memorize sequence  structure correlations in short segments from the PDB, and use these to model new structures: ROSETTA

36 Ab initio prediction of protein structure sample conformational space such that native-like conformations are found astronomically large number of conformations 5 states/100 residues = 5 100 = 10 70 select hard to design functions that are not fooled by non-native conformations (“decoys”)

37 Sampling conformational space – continuous approaches Most work in the field - Molecular dynamics - Continuous energy minimization (follow a valley) - Monte Carlo simulation - Genetic Algorithms Like real polypeptide folding process Cannot be sure if native-like conformations are sampled energy

38 Molecular dynamics Force = -dU/dx (slope of potential U); acceleration, force = m ×a(t) All atoms are moving so forces between atoms are complicated functions of time Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial Atoms move for very short times of 10 -15 seconds or 0.001 picoseconds (ps) x(t+  t) = x(t) + v(t)  t + [4a(t) – a(t-  t)]  t 2 /6 v(t+  t) = v(t) + [2a(t+  t)+5a(t)-a(t-  t)]  t/6 U kinetic = ½ Σ m i v i (t) 2 = ½ n K B T Total energy (U potential + U kinetic ) must not change with time new position old position new velocity old velocity acceleration n is number of coordinates (not atoms)

39 Energy minimization For a given protein, the energy depends on thousands of x,y,z Cartesian atomic coordinates; reaching a deep minimum is not trivial Furthermore, we want to minimize the free energy, not just the potential energy. energy number of steps deep minimum starting conformation

40 Monte Carlo Simulation Propose moves in torsion or Cartesian conformation space Evaluate energy after every move, compute  E Accept the new conformation based on If run infinite time, the simulated conformation follows the Boltzmann distribution Many variations, including simulated annealing and other heuristic approaches.

41 Scoring/energy functions Need a way to select native-like conformations from non-native ones Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms. Knowledge-based scoring functions: –Derive information about atomic properties from a database of experimentally determined conformations –Common parameters include pairwise atomic distances and amino acid burial/exposure.

STAT11542 Rosetta D. Baker, U. Wash Break sequence into short segments (7-9 AA) Sample 3D from library of known segment structures, parallel computation Use simulated annealing (metropolis-type algorithm) for global optimization –Propose a change, if better energy, take; otherwise take at smaller probability Create 1000 structures, cluster and choose one representative from each cluster to submit

STAT11543 Manual Improvements and Automation Very often manual examination could improve prediction –Catch errors –Need domain knowledge –A. Murzin’s success at CASP2 CAFASP: Critical Assessment of Fully Automated Structure Prediction –Murzin Can’t play!! MetaServers: combine different methods to get consensus

STAT11544 CAFASP Evaluation

STAT11545 Structural Genomics With more and more solved structures and novel folds, computational protein structure prediction is going to improve Structural genomics:Structural genomics –Worldwide initiative to high throughput determine many protein structures –Especially, solve structures that have no homology

STAT11546 Summary Protein structures: 1 st, 2 nd, 3 rd, 4 th –Different DB: SwissProt, PDB and SCOP –Determine structure: X-ray crystallography Protein structure prediction: –2 nd structure prediction –Homology modeling –Fold recognition –Ab initio –Evaluation: energy, RMSD, etc –CASP and CAFASP contest Manual improvement and combination of computational approaches work better Structural Genomics, still very difficult problem…

STAT11547 Acknowledgement Amy Keating Michael Yaffe Mark Craven Russ Altman

Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115.

Similar presentations

Presentation on theme: "Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115.

Similar presentations

Presentation on theme: "Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115."— Presentation transcript:

Similar presentations

About project

Feedback