Protein Structure Prediction Graham Wood Charlotte Deane
The problem - in brief MVLSEGEWQL VLHVWAKVEA DVAGHGQDIL … AKYKELCYOG Databases Algorithms Software +=
Why is protein structure prediction needed? Essential functioning of cells is mediated by proteins It is protein structure that leads to protein function 3D structure determination is expensive, slow and difficult (by X-ray crystallography or NMR) Assists in the engineering of new proteins
Terminology Target - the unknown structure you are trying to model Parent - a known structure which provides a basis for modelling
The problem- more detail Configuration space Energy EKGPDLYLIPLT Protein databases EKGPDLYLIPLT Biologist Physicist
CASP Critical Assessment of Structure Prediction Jan-Apr May Jun Jul Aug Sept Oct Nov Dec Biologists Caspers Organisers Call for structures Publish seqs on web Give sequences to organisers Structure determination Give structures to organisers Predict structure from sequence Expert assessment 4 day mtg
Degree of evolutionary conservation Less conserved Information poor More conserved Information rich DNA seqProtein SeqStructureFunction ACAGTTACAC CGGCTATGTA CTATACTTTG HDSFKLPVMS KFDWEMFKPC GKFLDSGKLG
Three main approaches (in order of current success) 1.Comparative modelling 2.Fold recognition 3.De novo
Comparative modelling Conserved backbone Energy EKGPDLYLIPLT Target Close homologues Variable backbone Side chains
Comparative modelling (protein building) 1.Prepare the raw materials 2.Build the model (two methods) 3.Check the model 4.Accept or reject the model
C1: Preparing the raw materials Structurally align parents Align target to parents EKGPDLYLIPLT Given target AA sequence Identify parents (homologues)
loop region secondary structure region Structurally conserved regions and structurally variable regions SCR SVR
C2: Building (choice of two methods) Attach and orient side-chains Refine model Determine SCRs and build associated backbone Determine SVRs and build rest of backbone Assemble fragmentsUse spatial restraints
C2: Building (choice of two methods) Orient side-chains Refine model Determine SCRs and build associated backbone Determine SVRs and build rest of backbone Assemble fragmentsUse spatial restraints Optimally satisfy spatial restraints
D T N V A Y C N K D
C3: Test model (C4: then accept or reject) Examine the model in the light of all experimental data PROCHECK, VERIFY3D, PROSA II, Visual inspection using 3D software, JOY
Problems in comparative modelling Aligning the target to the parents The packing of secondary structure elements in the core The long insertions and deletions in the structurally variable regions
Fold Recognition ? Target
Fold recognition Energy EKGPDLYLIPLT Target Structurally similar proteins
Fold recognition (protein finding) 1.Obtain library of non-duplicate folds 2.Perform sequence-structure alignment 3.Assess success of alignment Biologist – use substitution matrix Physicist – use potentials 4.Accept or reject the model
Sequence-structure alignment 1. Construct sequence profile 2. Use profile to score the sequence TargetParent BLASTP OWLMULTAL Dynamic programming algorithm Score
Amino acid substitutions are constrained by local environments Different substitution patterns Environment-specific substitution tables
Main-chain conformation and secondary structure (α-helix, β-strand, coil and positive φ) Solvent accessibility (accessible and inaccessible) Hydrogen bonds (side-chain to main-chain NH, side-chain to main-chain CO and side-chain to side-chain) Definition of local environments
Substitution scores Background probability of observing amino acid b, match occurring by chance Log odds score scaled to the nearest integer Probability that amino acid a in environment E is replaced by amino acid b Frequency of observing amino acid a in environment E replaced by b
Scoring with potentials Energy potential Solvation potential
The Novel Fold Problem ? asdghklprtwecvmnasetyasdghklprtwecvmnasety
De novo – new fold methods Energy EKGPDLYLIPLT Segment configurations Sets of local configurations
Defining a “New Fold” CATH –Somewhat objective SCOP –No objective definition –Tends towards evolutionary relationships Ask A. Murzin
New fold approach All structure information is in the AA sequence (Anfinson, Science, 1973) Seek “lowest free energy conformation” Tactic is to simplify the problem, for example Simplified model of protein (one atom per residue) Simple or knowledge based potential function Assist in detecting distant homologues
New fold recognition (structure discovery) 1.Set up domain and objective function 2.Perform optimisation 3.Check the model 4.Accept or reject the model
De Novo (biologist) ROSETTA (Baker et al.) Domain of objective function sequence 9 residues... Set of local structures consistent with local sequence
De Novo (biologist) ROSETTA Objective function to be maximised constant Function of energy
De Novo (biologist) ROSETTA Maximising the probability of the sequence 1.Choose each local conformation and start with a fully extended chain 2.Generate a neighbouring conformation 3.Accept in simulated annealing style, using P(structure|sequence) 4.Do this many times and cluster results – use centre of largest cluster as prediction
De Novo (physicist) ASTROFOLD (Floudas et al.) 1.Predict α-helices and β-strands 2.Predict β-sheets and disulphide bridges using ILP 3.Use deterministic global optimisation, with energy function and constraints to predict tertiary structure
Testing of prediction servers - LiveBench SensitivitySpecificityAdded Value ServerTypeEasyHardAllHardEasyHard Pcons2Consensus ShotGun on 5Consensus ShotGun on 3Consensus Shotgun-INBGUThreading INBGUThreading Fugue3Threading Fugue2Threading Fugue1Threading mGenTHREADERThreading GenTHREADERThreading D-PSSMThreading ORFeusSequence FFASSequence Sam-T99Sequence SuperfamilySequence ORF-BLASTBLAST PDB-BLASTBLAST BLAST 18
Review - comparative modelling Conserved backbone Energy EKGPDLYLIPLT Target Close homologues Variable backbone Side chains
Review - fold recognition Energy EKGPDLYLIPLT Target Structurally similar proteins
Review - new fold methods Energy EKGPDLYLIPLT Segment configurations Sets of local configurations
Summary: Prediction Methods Comparative modelling –There exists a protein with clear homology –PSI-BLAST Fold recognition –There exists a protein of similar fold (analogy) –DALI (CATH & SCOP) Novel Fold methods –The sequence has a new fold Better methods needed yet for it all to be useful!