Protein Structure Prediction Ram Samudrala University of Washington.

Protein Structure Prediction Ram Samudrala University of Washington

Rationale for understanding protein structure and function Protein sequence -large numbers of sequences, including whole genomes Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution ? structure determination structure prediction homology rational mutagenesis biochemical analysis model studies Protein structure - three dimensional - complicated - mediates function

Protein folding …-L-K-E-G-V-S-K-D-… …-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-… one amino acid DNA protein sequence unfolded protein native state spontaneous self-organisation (~1 second) not unique mobile inactive expanded irregular

Protein folding …-L-K-E-G-V-S-K-D-… …-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-… one amino acid DNA protein sequence unfolded protein native state spontaneous self-organisation (~1 second) unique shape precisely ordered stable/functional globular/compact helices and sheets not unique mobile inactive expanded irregular

unfolded Protein folding landscape Large multi-dimensional space of changing conformations free energy folding reaction molten globule J=10 -8 s native J=10 -3 s *G**G* barrier height

Protein primary structure twenty types of amino acids R H C OHOH O N H H CαCα two amino acids join by forming a peptide bond R CαCα H C O N H HN CαCα H C O OHOH R H R CαCα H C O N H N CαCα H C O R H R CαCα H C O N H N CαCα H C O R H             each residue in the amino acid main chain has two degrees of freedom (  and  the amino acid side chains can have up to four degrees of freedom  1-4 

Protein secondary structure   L  0 0  +180 -180 many  combinations are not possible  helix  sheet (anti-parallel) N C N C  sheet (parallel)

Protein tertiary and quaternary structures Ribonuclease inhibitor (2bnh) Haemoglobin (1hbh) Hemagglutinin (1hgd)

Methods for determining protein structure Protein sequence -large numbers of sequences, including whole genomes Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution ? X-ray crystallography NMR spectroscopy homology rational mutagenesis biochemical analysis model studies Protein structure - three dimensional - complicated - mediates function

X-ray crystallography- concept X-rays interact with electrons in protein molecules arranged in a crystal to produce diffraction patterns The diffraction patterns of the x-rays can be used to determine the three-dimensional structure of proteins Provides a “static” picture From

Prepare protein crystals where the proteins are organised in a precise crystal lattice Shine x-rays on crystals which diffract off of electrons of atoms in the crystals; the intensities of the individual reflections are measured Phases are usually obtained indirectly by ismorphous replacement, from the way one or a few heavy atoms incorporated into the same isomorphous crystal lattice affect the diffraction patern Intensities and phases of all reflections are combined in a Fourier transform to provide maps of electron density Interpret the map by fitting the polypeptide chain to the contours Refine the model by minimising the distance between the observed amplitudes and the calculated amplitudes X-ray crystallography- details

NMR spectroscopy - concept The magnetic-spin properties of atomic nuclei within a molecule are used to obtain a list of distance constraints between atoms in the molecule, from which a three-dimensional structure of the protein molecule can be obtained Provides a “dynamic” picture NK-lysin (1nkl)S1 RNA binding domain (1sro)

NMR spectroscopy - details Protein molecules placed in a strong magnetic field have their hydrogen atoms aligned to the field; the alignment can be excited by applying radio frequency (RF) pulses Possible to obtain unique signal (chemical shift) for each hydrogen atom in a protein molecule Structural information arises primarily from the Nuclear Overhauser Effect (NOE), which gives information about distances between atoms in a molecule A pair of protons give a detectable NOE cross-peak if they are within 5.0 Å of each other in space After obtaining NOE data for protons througout the structure, a number of independent structures can be generated that are consistent with the distance constraints

Computer representation of protein structure Structures are stored in the protein data bank (PDB), a repository of mostly experimental models based on X-ray crystallographic and NMR studies Atoms are defined by their Cartesian coordinates: ATOM 1 N GLU 1 18.222 18.496 -16.203 1.00 21.95 ATOM 2 CA GLU 1 17.706 17.982 -14.905 1.00 16.74 ATOM 3 C GLU 1 17.368 16.466 -15.121 1.00 15.45 ATOM 4 O GLU 1 16.780 16.073 -16.175 1.00 18.81 ATOM 5 CB GLU 1 16.552 18.744 -14.351 1.00 17.35 ATOM 6 CG GLU 1 16.952 20.118 -13.803 1.00 24.48 ATOM 7 CD GLU 1 15.881 21.145 -13.597 1.00 31.51 ATOM 8 OE1 GLU 1 16.012 22.316 -13.292 1.00 29.12 ATOM 9 OE2 GLU 1 14.701 20.768 -13.799 1.00 35.19 ATOM 10 N PHE 2 17.762 15.746 -14.052 1.00 15.83 ATOM 11 CA PHE 2 17.509 14.262 -14.184 1.00 13.24 These structures provide the basis for most of theoretical work in protein folding and protein structure prediction

Comparison of protein structures Need ways to determine if two protein structures are related and to compare predicted models to experimental structures Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979): Usually use C  atoms 3.6 Å2.9 Å NK-lysin (1nkl)Bacteriocin T102/as48 (1e68)T102 best model Other measures include contact maps and torsion angle RMSDs

Methods for predicting protein structure Protein sequence -large numbers of sequences, including whole genomes Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution ? comparative modelling fold recognition ab initio prediction homology rational mutagenesis biochemical analysis model studies Protein structure - three dimensional - complicated - mediates function

Comparative modelling of protein structure Proteins that have similar sequences (i.e., related by evolution) have similar three-dimensional structures A model of a protein whose structure is not known can be constructed if the structure of a related protein has been determined by experimental methods Similarity must be obvious and significant for good models to be built Need ways to build regions that are not similar between the two related proteins Need ways to move model closer to the native structure

Comparative modelling of protein structure KDHPFGFAVPTKNPDGTMNLMNWECAIP KDPPAGIGAPQDN----QNIMLWNAVIP ** * * * * * * * ** …… scan align build initial model construct non-conserved side chains and main chains refine

Fold recognition The number of possible protein structures/folds is limited (large number of sequences but few folds) Proteins that do not have similar sequences sometimes have similar three-dimensional structures A sequence whose structure is not known is fitted directly (or “threaded”) onto a known structure and the “goodness of fit” is evaluated using a discriminatory function Need ways to move model closer to the native structure 3.6 Å 5% ID NK-lysin (1nkl) Bacteriocin T102/as48 (1e68)

Fold recognition KDHPFGFAVPTKNPDGTMNLMNWECAIP KDPPAGIGAPQDN----QNIMLWNAVIP ** * * * * * * * ** …… evaluate fit build initial model construct non-conserved side chains and main chains refine

Ab initio prediction of protein structure – concept Go from sequence to structure by sampling the conformational space in a reasonable manner and select a native-like conformation using a good discrimination function Problems: conformational space is astronomical, and it is hard to design functions that are not fooled by non-native conformations (or “decoys”)

Ab initio prediction of protein structure sample conformational space such that native-like conformations are found astronomically large number of conformations 5 states/100 residues = 5 100 = 10 70 select hard to design functions that are not fooled by non-native conformations (“decoys”)

Sampling conformational space – continuous approaches Most work in the field - Molecular dynamics - Continuous energy minimisation (follow a valley) - Monte Carlo simulation - Genetic Algorithms Like real polypeptide folding process Cannot be sure if native-like conformations are sampled energy

Molecular dynamics Force = -dU/dx (slope of potential U); acceleration, m a(t) = force All atoms are moving so forces between atoms are complicated functions of time Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial Atoms move for very short times of 10 -15 seconds or 0.001 picoseconds (ps) x(t+  t) = x(t) + v(t)  t + [4a(t) – a(t-  t)]  t 2 /6 v(t+  t) = v(t) + [2a(t+  t)+5a(t)-a(t-  t)]  t/6 U kinetic = ½ Σ m i v i (t) 2 = ½ n K B T Total energy (U potential + U kinetic ) must not change with time new position old position new velocity old velocity acceleration old velocity n is number of coordinates (not atoms)

Energy minimisation For a given protein, the energy depends on thousands of x,y,z Cartesian atomic coordinates; reaching a deep minimum is not trivial With convergence, we have an accurate equilibrium conformation and a well-defined energy value energy number of steps deep minimum starting conformation steepest descent conjugate gradient energy number of steps give up converge RMSD

Monte Carlo simulation Discrete moves in torsion or cartesian conformational space Evaluate energy after every move and compare to previous energy (  E) Accept conformation based on Boltzmann probability: Many variations, including simulated annealing (starting with a high temperature so more moves are accepted initially and then cooling) If run for infinite time, simulation will produce a Boltzmman distribution

Genetic Algorithms Generate an initial pool of conformations Perform crossover and mutation operations on this set to generate a much larger pool of conformations Select a subset of the fittest conformations from this large pool Repeat above two steps until convergence

Sampling conformational space – exhaustive approaches enumerate all possible conformations view entire space (perfect partition function) computationally intractable: 5 states/100 residues = 5 100 = 10 70 possible conformations select must use discrete state models to minimise number of conformations explored

Scoring/energy functions Need a way to select native-like conformations from non-native ones Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms Knowledge-based scoring functions: derive information about atomic properties from a database of experimentally determined conformations; common parametres include pairwise atomic distances and amino acid burial/exposure.

Requirements for sampling methods and scoring functions Sampling methods must produce good decoy sets that are comprehensive and include several native-like structures Scoring function scores must correlate well with RMSD of conformations (the better the score/energy, the lower the RMSD)

Overview of CASP experiment Three categories: comparative/homology modelling, fold recognition/threading, and ab initio prediction Goal is to assess structure prediction methods in a blind and rigourous manner; blind prediction is necessary for accurate assessment of methods Ask modellers to build models of structures as they are in the process of being solved experimentally After prediction season is over, compare predicted models to the experimental structures Discuss what went right, what went wrong, and why Compare progress from CASP1 to CASP4 Results published in special issues of Proteins: Structure, Function, Genetics 1995, 1997, 1999, 2002

Comparative modelling at CASP - methods Alignment: PSI-BLAST, FASTA, CLUSTALW - multiple sequence alignments carefully hand-edited using secondary structure information More successful side chain prediction methods include: backbone-dependent rotamer libraries (Bower & Dunbrack) segment matching followed by energy minimisation (Levitt) self-consistent mean field optimisation (Bates et al) graph-theory + knowledge-based functions (Samudrala et al) More successful loop building methods include: satisfaction of spatial restraints (Sali) internal coordinate mechanics energy optimisation (Abagyan et al) graph-theory + knowledge-based functions (Samudrala et al) Overall model building: there is no substitute for careful hand-constructed models (Sternberg et al, Venclovas)

A graph theoretic representation of protein structure -0.6 (V 1 ) -1.0 (F) -0.7 (K) -0.5 (I) -0.9 (V 2 ) weigh nodes -0.5 (I)-0.9 (V 2 ) -1.0 (F) -0.7 (K) -0.3 -0.4 -0.2 -0.1 find cliques W = -4.5 represent residues as nodes -0.5 (I) -0.6 (V 1 ) -0.9 (V 2 ) -1.0 (F) -0.7 (K) -0.3 -0.4 -0.2 -0.1 -0.2 construct graph -0.1

Historical perspective on comparative modelling BC excellent ~ 80% 1.0 Å 2.0 Å alignment side chain short loops longer loops

Historical perspective on comparative modelling CASP1 poor ~ 50% ~ 3.0 Å > 5.0 Å BC excellent ~ 80% 1.0 Å 2.0 Å alignment side chain short loops longer loops

Prediction for CASP4 target T128/sodm C  RMSD of 1.0 Å for 198 residues (PID 50%)

Prediction for CASP4 target T111/eno C  RMSD of 1.7 Å for 430 residues (PID 51%)

Prediction for CASP4 target T122/trpa C  RMSD of 2.9 Å for 241 residues (PID 33%)

Prediction for CASP4 target T125/sp18 C  RMSD of 4.4 Å for 137 residues (PID 24%)

Prediction for CASP4 target T112/dhso C  RMSD of 4.9 Å for 348 residues (PID 24%)

Prediction for CASP4 target T92/yeco C  RMSD of 5.6 Å for 104 residues (PID 12%)

CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity **T112/dhso – 4.9 Å (348 residues; 24%)**T92/yeco – 5.6 Å (104 residues; 12%) **T128/sodm – 1.0 Å (198 residues; 50%) **T125/sp18 – 4.4 Å (137 residues; 24%) **T111/eno – 1.7 Å (430 residues; 51%)**T122/trpa – 2.9 Å (241 residues; 33%) Comparative modelling at CASP - conclusions CASP2 fair ~ 75% ~ 1.0 Å ~ 3.0 Å CASP3 fair ~75% ~ 1.0 Å ~ 2.5 Å CASP4 fair ~75% ~ 1.0 Å ~ 2.0 Å CASP1 poor ~ 50% ~ 3.0 Å > 5.0 Å BC excellent ~ 80% 1.0 Å 2.0 Å alignment side chain short loops longer loops

Fold recognition at CASP - methods Visual inspection with sequence comparison (Murzin group) Procyon - potential of mean force based on pairwise interactions and global dynamic programming (Sippl group) Threader - potential of mean force and double dynamic programming (Jones group) Environmental 3D Profiles (Eisenberg group) NCBI Threading Program using contact potentials and models of sequence-structure conservation (Bryant group) Hidden Markov Models (Karplus group) Combination of threading with ab initio approaches (Friesner group) Environment-specific substitution tables and structure-dependent gap penalties (Blundell group)

Fold recognition at CASP - conclusions Fold recognition is one of the more successful approaches at predicting structure at all four CASPs At CASP2 and CASP4, one of the best methods was simple sequence searching with careful manual inspection (Murzin group) At CASP3 and CASP4, none of the threading targets could have been recognised by the best standard sequence comparison methods such as PSI-BLAST For the most difficult targets, the methods were able to predict  60 residues to 6.0 Å C  RMSD, approaching comparative modelling accuracies as the similarity between proteins increased.

Ab initio prediction at CASP – methods Assembly of fragments with simulated annealing (Simons et al) Exhaustive sampling and pruning using knowledge-based scoring functions (Samudrala et al) Constraint-based Monte Carlo optimisation (Skolnick et al) Thermodynamic model for secondary structure prediction with manual docking of secondary structure elements and minimisation (Lomize et al) Minimisation of a physical potential energy function with a simplified representation (Scheraga et al, Osguthorpe et al) Neural networks to predict secondary structure (Jones, Rost)

Semi-exhaustive segment-based folding EFDVILKAAGANKVAVIKAVRGATGLGLKEAKDLVESAPAALKEGVSKDDAEALKKALEEAGAEVEVK generate fragments from database 14-state ,  model …… minimise monte carlo with simulated annealing conformational space annealing, GA …… filter all-atom pairwise interactions, bad contacts compactness, secondary structure

Historical perspective on ab initio prediction Before CASP (BC): “solved” (biased results) CASP1: worse than random CASP2: worse than random with one exception CASP4: ? CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues *T56/dnab – 6.8 Å (60 residues; 67-126) **T59/smd3 – 6.8 Å (46 residues; 30-75) **T61/hdea – 7.4 Å (66 residues; 9-74)**T64/sinr – 4.8 Å (68 residues; 1-68) *T74/eps15 – 7.0 Å (60 residues; 154-213) **T75/ets1 – 7.7 Å (77 residues; 55-131)

Prediction for CASP4 target T110/rbfa C  RMSD of 4.0 Å for 80 residues (1-80)

Prediction for CASP4 target T97/er29 C  RMSD of 6.2 Å for 80 residues (18-97)

Prediction for CASP4 target T106/sfrp3 C  RMSD of 6.2 Å for 70 residues (6-75)

Prediction for CASP4 target T98/sp0a C  RMSD of 6.0 Å for 60 residues (37-105)

Prediction for CASP4 target T126/omp C  RMSD of 6.5 Å for 60 residues (87-146)

Prediction for CASP4 target T114/afp1 C  RMSD of 6.5 Å for 45 residues (36-80)

Postdiction for CASP4 target T102/as48 C  RMSD of 5.3 Å for 70 residues (1-70)

Ab initio prediction at CASP - conclusions CASP1: worse than random CASP2: worse than random with one exception CASP4: consistently predicted correct topology - ~4-6.0 A for 60-80+ residues CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues **T110/rbfa – 4.0 Å (80 residues; 1-80)*T114/afp1 – 6.5 Å (45 residues; 36-80) **T97/er29 – 6.0 Å (80 residues; 18-97) **T106/sfrp3 – 6.2 Å (70 residues; 6-75) *T98/sp0a – 6.0 Å (60 residues; 37-105)**T102/as48 – 5.3 Å (70 residues; 1-70) Before CASP (BC): “solved” (biased results)

Computational aspects of structural genomics D. ab initio prediction C. fold recognition * * * * * * * * * * B. comparative modelling A. sequence space * * * * * * * * * * * * E. target selection targets F. analysis * * (Figure idea by Steve Brenner.)

Key points DNA/gene is the blueprint - proteins are the functional representatives of genes Protein structure can be used to understand protein function Large numbers of genes being sequenced - need structures Protein folding (from primary sequence to tertiary structure) is a fast self-organising process where a disordered non-functional chain of amino acids becomes a stable, compact, and functional molecule The free energy difference between the folded and unfolded states is not very high Experimental methods to determine protein structures include x-ray crystallography and NMR spectroscopy Theoretical methods to predict protein structures include comparative/homology modelling, fold recognition/threading, and ab initio prediction For ab initio prediction, you need a method that samples the conformational space adequately (to find native-like conformations) and a function that can identify them CASP experiment shows limited progress in protein structure prediction

Michael Levitt, Stanford University John Moult, CARB Patrice Koehl, Stanford University Yu Xia, Stanford Univeristy Levitt and Moult groups Acknowledgements

Protein Structure Prediction Ram Samudrala University of Washington.

Similar presentations

Presentation on theme: "Protein Structure Prediction Ram Samudrala University of Washington."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Structure Prediction Ram Samudrala University of Washington.

Similar presentations

Presentation on theme: "Protein Structure Prediction Ram Samudrala University of Washington."— Presentation transcript:

Similar presentations

About project

Feedback