Protein Structure Prediction Ram Samudrala University of Washington.

Slides:



Advertisements
Similar presentations
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Advertisements

Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Protein Tertiary Structure Prediction
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
Protein Structure, Databases and Structural Alignment
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Thomas Blicher Center for Biological Sequence Analysis
Protein Primer. Outline n Protein representations n Structure of Proteins Structure of Proteins –Primary: amino acid sequence –Secondary:  -helices &
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
1 Protein Structure Prediction Reporter: Chia-Chang Wang Date: April 1, 2005.
Protein Tertiary Structure. Primary: amino acid linear sequence. Secondary:  -helices, β-sheets and loops. Tertiary: the 3D shape of the fully folded.
Molecular modelling / structure prediction (A computational approach to protein structure) Today: Why bother about proteins/prediction Concepts of molecular.
Protein structure determination & prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray.
IV. Protein Structure Prediction and Determination Methods of protein structure determination Critical assessment of structure prediction Homology modelling.
1 Protein Structure Prediction Charles Yan. 2 Different Levels of Protein Structures The primary structure is the sequence of residues in the polypeptide.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modelling Thomas Blicher Center for Biological Sequence Analysis.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115.
Bioinf. Data Analysis & Tools Molecular Simulations & Sampling Techniques117 Jan 2006 Bioinformatics Data Analysis & Tools Molecular simulations & sampling.
Computational Structure Prediction Kevin Drew BCH364C/391L Systems Biology/Bioinformatics 2/12/15.
Homology Modeling David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
Protein Tertiary Structure Prediction
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
Chapter 12 Protein Structure Basics. 20 naturally occurring amino acids Free amino group (-NH2) Free carboxyl group (-COOH) Both groups linked to a central.
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
COMPARATIVE or HOMOLOGY MODELING
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
 Four levels of protein structure  Linear  Sub-Structure  3D Structure  Complex Structure.
Representations of Molecular Structure: Bonds Only.
Biomolecular Nuclear Magnetic Resonance Spectroscopy BASIC CONCEPTS OF NMR How does NMR work? Resonance assignment Structure determination 01/24/05 NMR.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Secondary structure prediction
Modelling Genome Structure and Function Ram Samudrala University of Washington.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Applied Bioinformatics Week 12. Bioinformatics & Functional Proteomics How to classify proteins into functional classes? How to compare one proteome with.
Protein Folding and Modeling Carol K. Hall Chemical and Biomolecular Engineering North Carolina State University.
Structure prediction: Homology modeling
Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.
Computational engineering of bionanostructures Ram Samudrala University of Washington How can we analyse, design, & engineer peptides capable of specific.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Predicting Protein Structure: Comparative Modeling (homology modeling)
Modelling protein tertiary structure Ram Samudrala University of Washington.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Structure prediction: Ab-initio Lecture 9 Structural Bioinformatics Dr. Avraham Samson Let’s think!
Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.
Protein Structure Prediction Graham Wood Charlotte Deane.
Modelling proteomes Ram Samudrala Department of Microbiology How does the genome of an organism specify its behaviour and characteristics?
Modelling proteins and proteomes using Linux clusters Ram Samudrala University of Washington.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
CS-ROSETTA Yang Shen et al. Presented by Jonathan Jou.
Lecture 10 CS566 Fall Structural Bioinformatics Motivation Concepts Structure Solving Structure Comparison Structure Prediction Modeling Structural.
Structure/function studies of HIV proteins HIV gp120 V3 loop modelling using de novo approaches HIV protease-inhibitor binding energy prediction.
Ab-initio protein structure prediction ? Chen Keasar BGU Any educational usage of these slides is welcomed. Please acknowledge.
Modelling genome structure and function Ram Samudrala University of Washington.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Modelling Genome Structure and Function Ram Samudrala University of Washington.
Modelling proteomes Ram Samudrala University of Washington How does the genome of an organism specify its behaviour and characteristics?
Modelling genome structure and function - a practical approach Ram Samudrala University of Washington.
Computational Structure Prediction
How does the genome of an organism
University of Washington
Modelling the rice proteome
Protein Structures.
Protein structure prediction.
Presentation transcript:

Protein Structure Prediction Ram Samudrala University of Washington

Rationale for understanding protein structure and function Protein sequence -large numbers of sequences, including whole genomes Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution ? structure determination structure prediction homology rational mutagenesis biochemical analysis model studies Protein structure - three dimensional - complicated - mediates function

Protein folding …-L-K-E-G-V-S-K-D-… …-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-… one amino acid DNA protein sequence unfolded protein native state spontaneous self-organisation (~1 second) not unique mobile inactive expanded irregular

Protein folding …-L-K-E-G-V-S-K-D-… …-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-… one amino acid DNA protein sequence unfolded protein native state spontaneous self-organisation (~1 second) unique shape precisely ordered stable/functional globular/compact helices and sheets not unique mobile inactive expanded irregular

unfolded Protein folding landscape Large multi-dimensional space of changing conformations free energy folding reaction molten globule J=10 -8 s native J=10 -3 s *G**G* barrier height

Protein primary structure twenty types of amino acids R H C OHOH O N H H CαCα two amino acids join by forming a peptide bond R CαCα H C O N H HN CαCα H C O OHOH R H R CαCα H C O N H N CαCα H C O R H R CαCα H C O N H N CαCα H C O R H             each residue in the amino acid main chain has two degrees of freedom (  and  the amino acid side chains can have up to four degrees of freedom  1-4 

Protein secondary structure   L  0 0  many  combinations are not possible  helix  sheet (anti-parallel) N C N C  sheet (parallel)

Protein tertiary and quaternary structures Ribonuclease inhibitor (2bnh) Haemoglobin (1hbh) Hemagglutinin (1hgd)

Methods for determining protein structure Protein sequence -large numbers of sequences, including whole genomes Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution ? X-ray crystallography NMR spectroscopy homology rational mutagenesis biochemical analysis model studies Protein structure - three dimensional - complicated - mediates function

X-ray crystallography- concept X-rays interact with electrons in protein molecules arranged in a crystal to produce diffraction patterns The diffraction patterns of the x-rays can be used to determine the three-dimensional structure of proteins Provides a “static” picture From

Prepare protein crystals where the proteins are organised in a precise crystal lattice Shine x-rays on crystals which diffract off of electrons of atoms in the crystals; the intensities of the individual reflections are measured Phases are usually obtained indirectly by ismorphous replacement, from the way one or a few heavy atoms incorporated into the same isomorphous crystal lattice affect the diffraction patern Intensities and phases of all reflections are combined in a Fourier transform to provide maps of electron density Interpret the map by fitting the polypeptide chain to the contours Refine the model by minimising the distance between the observed amplitudes and the calculated amplitudes X-ray crystallography- details

NMR spectroscopy - concept The magnetic-spin properties of atomic nuclei within a molecule are used to obtain a list of distance constraints between atoms in the molecule, from which a three-dimensional structure of the protein molecule can be obtained Provides a “dynamic” picture NK-lysin (1nkl)S1 RNA binding domain (1sro)

NMR spectroscopy - details Protein molecules placed in a strong magnetic field have their hydrogen atoms aligned to the field; the alignment can be excited by applying radio frequency (RF) pulses Possible to obtain unique signal (chemical shift) for each hydrogen atom in a protein molecule Structural information arises primarily from the Nuclear Overhauser Effect (NOE), which gives information about distances between atoms in a molecule A pair of protons give a detectable NOE cross-peak if they are within 5.0 Å of each other in space After obtaining NOE data for protons througout the structure, a number of independent structures can be generated that are consistent with the distance constraints

Computer representation of protein structure Structures are stored in the protein data bank (PDB), a repository of mostly experimental models based on X-ray crystallographic and NMR studies Atoms are defined by their Cartesian coordinates: ATOM 1 N GLU ATOM 2 CA GLU ATOM 3 C GLU ATOM 4 O GLU ATOM 5 CB GLU ATOM 6 CG GLU ATOM 7 CD GLU ATOM 8 OE1 GLU ATOM 9 OE2 GLU ATOM 10 N PHE ATOM 11 CA PHE These structures provide the basis for most of theoretical work in protein folding and protein structure prediction

Comparison of protein structures Need ways to determine if two protein structures are related and to compare predicted models to experimental structures Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979): Usually use C  atoms 3.6 Å2.9 Å NK-lysin (1nkl)Bacteriocin T102/as48 (1e68)T102 best model Other measures include contact maps and torsion angle RMSDs

Methods for predicting protein structure Protein sequence -large numbers of sequences, including whole genomes Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution ? comparative modelling fold recognition ab initio prediction homology rational mutagenesis biochemical analysis model studies Protein structure - three dimensional - complicated - mediates function

Comparative modelling of protein structure Proteins that have similar sequences (i.e., related by evolution) have similar three-dimensional structures A model of a protein whose structure is not known can be constructed if the structure of a related protein has been determined by experimental methods Similarity must be obvious and significant for good models to be built Need ways to build regions that are not similar between the two related proteins Need ways to move model closer to the native structure

Comparative modelling of protein structure KDHPFGFAVPTKNPDGTMNLMNWECAIP KDPPAGIGAPQDN----QNIMLWNAVIP ** * * * * * * * ** …… scan align build initial model construct non-conserved side chains and main chains refine

Fold recognition The number of possible protein structures/folds is limited (large number of sequences but few folds) Proteins that do not have similar sequences sometimes have similar three-dimensional structures A sequence whose structure is not known is fitted directly (or “threaded”) onto a known structure and the “goodness of fit” is evaluated using a discriminatory function Need ways to move model closer to the native structure 3.6 Å 5% ID NK-lysin (1nkl) Bacteriocin T102/as48 (1e68)

Fold recognition KDHPFGFAVPTKNPDGTMNLMNWECAIP KDPPAGIGAPQDN----QNIMLWNAVIP ** * * * * * * * ** …… evaluate fit build initial model construct non-conserved side chains and main chains refine

Ab initio prediction of protein structure – concept Go from sequence to structure by sampling the conformational space in a reasonable manner and select a native-like conformation using a good discrimination function Problems: conformational space is astronomical, and it is hard to design functions that are not fooled by non-native conformations (or “decoys”)

Ab initio prediction of protein structure sample conformational space such that native-like conformations are found astronomically large number of conformations 5 states/100 residues = = select hard to design functions that are not fooled by non-native conformations (“decoys”)

Sampling conformational space – continuous approaches Most work in the field - Molecular dynamics - Continuous energy minimisation (follow a valley) - Monte Carlo simulation - Genetic Algorithms Like real polypeptide folding process Cannot be sure if native-like conformations are sampled energy

Molecular dynamics Force = -dU/dx (slope of potential U); acceleration, m a(t) = force All atoms are moving so forces between atoms are complicated functions of time Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial Atoms move for very short times of seconds or picoseconds (ps) x(t+  t) = x(t) + v(t)  t + [4a(t) – a(t-  t)]  t 2 /6 v(t+  t) = v(t) + [2a(t+  t)+5a(t)-a(t-  t)]  t/6 U kinetic = ½ Σ m i v i (t) 2 = ½ n K B T Total energy (U potential + U kinetic ) must not change with time new position old position new velocity old velocity acceleration old velocity n is number of coordinates (not atoms)

Energy minimisation For a given protein, the energy depends on thousands of x,y,z Cartesian atomic coordinates; reaching a deep minimum is not trivial With convergence, we have an accurate equilibrium conformation and a well-defined energy value energy number of steps deep minimum starting conformation steepest descent conjugate gradient energy number of steps give up converge RMSD

Monte Carlo simulation Discrete moves in torsion or cartesian conformational space Evaluate energy after every move and compare to previous energy (  E) Accept conformation based on Boltzmann probability: Many variations, including simulated annealing (starting with a high temperature so more moves are accepted initially and then cooling) If run for infinite time, simulation will produce a Boltzmman distribution

Genetic Algorithms Generate an initial pool of conformations Perform crossover and mutation operations on this set to generate a much larger pool of conformations Select a subset of the fittest conformations from this large pool Repeat above two steps until convergence

Sampling conformational space – exhaustive approaches enumerate all possible conformations view entire space (perfect partition function) computationally intractable: 5 states/100 residues = = possible conformations select must use discrete state models to minimise number of conformations explored

Scoring/energy functions Need a way to select native-like conformations from non-native ones Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms Knowledge-based scoring functions: derive information about atomic properties from a database of experimentally determined conformations; common parametres include pairwise atomic distances and amino acid burial/exposure.

Requirements for sampling methods and scoring functions Sampling methods must produce good decoy sets that are comprehensive and include several native-like structures Scoring function scores must correlate well with RMSD of conformations (the better the score/energy, the lower the RMSD)

Overview of CASP experiment Three categories: comparative/homology modelling, fold recognition/threading, and ab initio prediction Goal is to assess structure prediction methods in a blind and rigourous manner; blind prediction is necessary for accurate assessment of methods Ask modellers to build models of structures as they are in the process of being solved experimentally After prediction season is over, compare predicted models to the experimental structures Discuss what went right, what went wrong, and why Compare progress from CASP1 to CASP4 Results published in special issues of Proteins: Structure, Function, Genetics 1995, 1997, 1999, 2002

Comparative modelling at CASP - methods Alignment: PSI-BLAST, FASTA, CLUSTALW - multiple sequence alignments carefully hand-edited using secondary structure information More successful side chain prediction methods include: backbone-dependent rotamer libraries (Bower & Dunbrack) segment matching followed by energy minimisation (Levitt) self-consistent mean field optimisation (Bates et al) graph-theory + knowledge-based functions (Samudrala et al) More successful loop building methods include: satisfaction of spatial restraints (Sali) internal coordinate mechanics energy optimisation (Abagyan et al) graph-theory + knowledge-based functions (Samudrala et al) Overall model building: there is no substitute for careful hand-constructed models (Sternberg et al, Venclovas)

A graph theoretic representation of protein structure -0.6 (V 1 ) -1.0 (F) -0.7 (K) -0.5 (I) -0.9 (V 2 ) weigh nodes -0.5 (I)-0.9 (V 2 ) -1.0 (F) -0.7 (K) find cliques W = -4.5 represent residues as nodes -0.5 (I) -0.6 (V 1 ) -0.9 (V 2 ) -1.0 (F) -0.7 (K) construct graph -0.1

Historical perspective on comparative modelling BC excellent ~ 80% 1.0 Å 2.0 Å alignment side chain short loops longer loops

Historical perspective on comparative modelling CASP1 poor ~ 50% ~ 3.0 Å > 5.0 Å BC excellent ~ 80% 1.0 Å 2.0 Å alignment side chain short loops longer loops

Prediction for CASP4 target T128/sodm C  RMSD of 1.0 Å for 198 residues (PID 50%)

Prediction for CASP4 target T111/eno C  RMSD of 1.7 Å for 430 residues (PID 51%)

Prediction for CASP4 target T122/trpa C  RMSD of 2.9 Å for 241 residues (PID 33%)

Prediction for CASP4 target T125/sp18 C  RMSD of 4.4 Å for 137 residues (PID 24%)

Prediction for CASP4 target T112/dhso C  RMSD of 4.9 Å for 348 residues (PID 24%)

Prediction for CASP4 target T92/yeco C  RMSD of 5.6 Å for 104 residues (PID 12%)

CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity **T112/dhso – 4.9 Å (348 residues; 24%)**T92/yeco – 5.6 Å (104 residues; 12%) **T128/sodm – 1.0 Å (198 residues; 50%) **T125/sp18 – 4.4 Å (137 residues; 24%) **T111/eno – 1.7 Å (430 residues; 51%)**T122/trpa – 2.9 Å (241 residues; 33%) Comparative modelling at CASP - conclusions CASP2 fair ~ 75% ~ 1.0 Å ~ 3.0 Å CASP3 fair ~75% ~ 1.0 Å ~ 2.5 Å CASP4 fair ~75% ~ 1.0 Å ~ 2.0 Å CASP1 poor ~ 50% ~ 3.0 Å > 5.0 Å BC excellent ~ 80% 1.0 Å 2.0 Å alignment side chain short loops longer loops

Fold recognition at CASP - methods Visual inspection with sequence comparison (Murzin group) Procyon - potential of mean force based on pairwise interactions and global dynamic programming (Sippl group) Threader - potential of mean force and double dynamic programming (Jones group) Environmental 3D Profiles (Eisenberg group) NCBI Threading Program using contact potentials and models of sequence-structure conservation (Bryant group) Hidden Markov Models (Karplus group) Combination of threading with ab initio approaches (Friesner group) Environment-specific substitution tables and structure-dependent gap penalties (Blundell group)

Fold recognition at CASP - conclusions Fold recognition is one of the more successful approaches at predicting structure at all four CASPs At CASP2 and CASP4, one of the best methods was simple sequence searching with careful manual inspection (Murzin group) At CASP3 and CASP4, none of the threading targets could have been recognised by the best standard sequence comparison methods such as PSI-BLAST For the most difficult targets, the methods were able to predict  60 residues to 6.0 Å C  RMSD, approaching comparative modelling accuracies as the similarity between proteins increased.

Ab initio prediction at CASP – methods Assembly of fragments with simulated annealing (Simons et al) Exhaustive sampling and pruning using knowledge-based scoring functions (Samudrala et al) Constraint-based Monte Carlo optimisation (Skolnick et al) Thermodynamic model for secondary structure prediction with manual docking of secondary structure elements and minimisation (Lomize et al) Minimisation of a physical potential energy function with a simplified representation (Scheraga et al, Osguthorpe et al) Neural networks to predict secondary structure (Jones, Rost)

Semi-exhaustive segment-based folding EFDVILKAAGANKVAVIKAVRGATGLGLKEAKDLVESAPAALKEGVSKDDAEALKKALEEAGAEVEVK generate fragments from database 14-state ,  model …… minimise monte carlo with simulated annealing conformational space annealing, GA …… filter all-atom pairwise interactions, bad contacts compactness, secondary structure

Historical perspective on ab initio prediction Before CASP (BC): “solved” (biased results) CASP1: worse than random CASP2: worse than random with one exception CASP4: ? CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues *T56/dnab – 6.8 Å (60 residues; ) **T59/smd3 – 6.8 Å (46 residues; 30-75) **T61/hdea – 7.4 Å (66 residues; 9-74)**T64/sinr – 4.8 Å (68 residues; 1-68) *T74/eps15 – 7.0 Å (60 residues; ) **T75/ets1 – 7.7 Å (77 residues; )

Prediction for CASP4 target T110/rbfa C  RMSD of 4.0 Å for 80 residues (1-80)

Prediction for CASP4 target T97/er29 C  RMSD of 6.2 Å for 80 residues (18-97)

Prediction for CASP4 target T106/sfrp3 C  RMSD of 6.2 Å for 70 residues (6-75)

Prediction for CASP4 target T98/sp0a C  RMSD of 6.0 Å for 60 residues (37-105)

Prediction for CASP4 target T126/omp C  RMSD of 6.5 Å for 60 residues (87-146)

Prediction for CASP4 target T114/afp1 C  RMSD of 6.5 Å for 45 residues (36-80)

Postdiction for CASP4 target T102/as48 C  RMSD of 5.3 Å for 70 residues (1-70)

Ab initio prediction at CASP - conclusions CASP1: worse than random CASP2: worse than random with one exception CASP4: consistently predicted correct topology - ~4-6.0 A for residues CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues **T110/rbfa – 4.0 Å (80 residues; 1-80)*T114/afp1 – 6.5 Å (45 residues; 36-80) **T97/er29 – 6.0 Å (80 residues; 18-97) **T106/sfrp3 – 6.2 Å (70 residues; 6-75) *T98/sp0a – 6.0 Å (60 residues; )**T102/as48 – 5.3 Å (70 residues; 1-70) Before CASP (BC): “solved” (biased results)

Computational aspects of structural genomics D. ab initio prediction C. fold recognition * * * * * * * * * * B. comparative modelling A. sequence space * * * * * * * * * * * * E. target selection targets F. analysis * * (Figure idea by Steve Brenner.)

Key points DNA/gene is the blueprint - proteins are the functional representatives of genes Protein structure can be used to understand protein function Large numbers of genes being sequenced - need structures Protein folding (from primary sequence to tertiary structure) is a fast self-organising process where a disordered non-functional chain of amino acids becomes a stable, compact, and functional molecule The free energy difference between the folded and unfolded states is not very high Experimental methods to determine protein structures include x-ray crystallography and NMR spectroscopy Theoretical methods to predict protein structures include comparative/homology modelling, fold recognition/threading, and ab initio prediction For ab initio prediction, you need a method that samples the conformational space adequately (to find native-like conformations) and a function that can identify them CASP experiment shows limited progress in protein structure prediction

Michael Levitt, Stanford University John Moult, CARB Patrice Koehl, Stanford University Yu Xia, Stanford Univeristy Levitt and Moult groups Acknowledgements