Comparative Modelling Threading Baldomero Oliva Miguel UniversitatPompeuFabra
EVOLUTIONEVOLUTION
Diferences(proteins) Genetic information (DNA)
DNA secuence 3D
The Sanger Centre U.S. Genome Project ( 1990 ) Identify human genes Store the information in Data Banks Development of tools for analisis Venter, Patrinos & Collins DOEDOE NHGRINHGRINIHNIH
Increase of sequences GenBank (DNA) TrEMBL 92211Swiss Prot
Public Database Holdings:
3D known (20%) Homologs (30%) Remotes (10%) Unknown function (20%) NO homology (20%) sec.3D
Knowing the structure helps to: Knowing the structure helps to: TO KNOW TO KNOW the function: annotation IMPROVE IMPROVE the function: genetic engineering INHIBIT INHIBIT the function: drugs
Structure Factory express & purify cristallize X-ray analises structure
Objetive Enough information Not accurate NO information 3D
How does modelling help? Structural Genomics X-ray
Can we predict protein structures from genome sequences?
Study of known structures Assigning structure to sequences with unknown structure
secondary Tertiary Met Val Leu Tyr Asp Ser Thr sequence quaternary super-secondary domain Caracterize the conformation Ligand docking
TrioseFosfateIsomerase Carboxypeptidase Glucanase Ribonuclease Mioglobin
The number of different protein folds is limited: New Folds Known Folds
The number of different protein folds is limited: [ last update: Oct 2001 ]
Nº sequences > Nº folds Structura 3ª Classification
UCL, Janet Thornton & Christine Orengo Class (C), Architecture(A), Topology(T), Homologous superfamily (H) Protein Structure Classification CATH - Protein Structure Classification [ ] SCOP - Structural Classification of Proteins MRC Cambridge (UK), Alexey Murzin, Brenner S. E., Hubbard T., Chothia C. created by manual inspection comprehensive description of the structural and evolutionary relationships [ ]
Class(C) derived from secondary structure content is assigned automatically Architecture(A) describes the gross orientation of secondary structures, independent of connectivity. Topology(T) clusters structures according to their topological connections and numbers of secondary structures Homologous superfamily (H)
TSS toxin 8.8% Id Remote Cholera toxin 80% Id Homolog SCOP Family Superfamily Fold Class Enterotoxin Aminoacyl tRNA synthetase 4.4% Id Analog
MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK DEAEKLFNQD VDAAVRGILR NAKLKPVYDS LDAVRRCALI NMVFQMGETG VAGFTNSLRM LQQKRWDEAA VNLAKSRWYN QTPNRAKRVI TTFRTGTWDA YKNL Can we predict protein structures ?
Homology modeling Idea: Extrapolation of the structure for a new (target) sequence from the known 3D-structures of related family members (templates).
Alignments Nº sequences > Nº folds Step 1 Selection of templates Step 2 Alignment with templates
identity Number of residues aligned Percentage sequence identity/similarity (B.Rost, Columbia, NewYork) Sequence identity implies structural similarity Don’t know region..... Sequence similarity implies structural similarity?
Hidropaticity Plot 2D Structure 2D prediction Nº sequences > Nº folds
2D prediction (PHD)
Alignments Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs Hidropaticity Plot 2D Structure 2D prediction
Comparative Modelling (homology) Fold is more conserved than sequence. Secondary structure is the most conserved structure Loops have the higher variability in structure.
Comparative modelling - Composer - Modeler SCR & VR SCR VR - Ring Closure - DB Search - Classification Alignments Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs Hidropaticity Plot 2D Structure 2D prediction
MODEL Comparative modelling - Composer - Modeler SCR & VR SCR VR - Ring Closure - DB Search - Classification Alignments Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs Hidropaticity Plot 2D Structure 2D prediction Step 3 Model Building
Fragment Extraction Rigid-body Assembly Loop Construction Spatial Constrains Optimisation Methods Satisfaction of Spatial Constrains
averaging core template backbone atoms (weighted by local sequence similarity with the target sequence) Leave non-conserved regions (loops) for later …. a) Build conserved core framework Rigid Body assembly
use the “spare part” algorithm to find compatible fragments in a Loop-Database “ab-initio” rebuilding of loops (Monte Carlo, molecular dynamics, genetic algorithms, etc.) b) Loop modeling Rigid Body assembly
EF-Hand Calcium binding aa{baalal}bbXh{DXDpDG}Xh aa{baalal}bb Xh{DXDpDG}Xh D 100% D/N 100%
P-loop GTP binding Conformation bb{eppgag}aa Motif sequence hh{GhXXpG}Kp P-loop GTP binding Conformation bb{eppgag}aa Motif sequence hh{GhXXpG}Kp
NAD(P)/FADbinding bb{eab}aahh{GhG}hX bb{eab}aa hh{GhG}hX
Ring closure search Distance restraints
Building CA Superposition
Building
A. Sali & T. Blundel (1993) JMB 234, Modeling by Satisfaction of Spatial restraints M A T E A F T S G Q
Feature properties can be associated with a protein (e.g. X-ray resolution) residues (e.g. solvent accessibility) pairs of residues (e.g. C a - C a distance) other features (e.g. main chain classes) How can we derive modeling restraints from this data? A restraint is defined as probability density function (pdf) p(x): with Modeling by Satisfaction of Spatial restraints
combine basis pdfs to molecular probability density functions Modeling by Satisfaction of Spatial restraints
Satisfaction of spatial restraints Find the protein model with the highest probability Variable target function: Start with e.g. a random conformation model and use only local restraints minimize some steps using a conjugate gradient optimization repeat with introducing more and more long range restraints until all restraints are used Modeling by Satisfaction of Spatial restraints
Threading Idea: Find the optimal structure for a new (target) sequence in the set of known 3D-structures (templates) by threading the target sequence.
Nº sequences > Nº folds Classification - Homologs - Remotes - Analogs Step 1 Selection of templates FOLD ASSIGNMENT Structure 3D
Pseudo Potencial Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs
Pseudo-potencial Principles Potential of Mean Force: Applied on proteins Boltzman law Use frequencies instead of probabilities. They are taken from non-homologous proteins on the PDB.
Fold recognition / Threading Principle: Find a compatible fold for a given sequence.... >Protein XY MSTLYEKLGGTTAVDLAV DKFYERVLQDDRIKHFFA DVDMAKQRAHQKAFLTYA FGGTDKYDGRYMREAHKE LVENHGLNGEHFDAVAED LLATLKEMGVPEDLIAEV AAVAGAPAHKRDVLNQ ? Using... 1D – 3D profile matching, secondary structure predictions, position specific scoring matrices (PSSM), mean force potentials, keyword statistics,....
Predicción de “fold” Pseudo Potencial Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs “fold” prediction
Predicción del “fold” Combine sequence with all folds Evaluate energies Data Base PDB Data Base PDB Energy < threshold ? Sequence Are energy profiles native-like ? YES NO
Fold recognition from Secondary structure prediction TOPITS
MODEL Predicción de “fold” Pseudo Potencial Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs “fold” prediction
Hidropaticity Plot 2D prediction MODEL Predicción de “fold” Pseudo Potencial Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs “fold” prediction 2D Structure Comparative modelling - Composer - Modeler SCR & VR SCR VR - Ring Closure - DB Search - Classification Alignments
Hidropaticity Plot 2D prediction Evaluate de model MODEL Predicción de “fold” Pseudo Potencial Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs “fold” prediction 2D Structure Comparative modelling - Composer - Modeler SCR & VR SCR VR - Ring Closure - DB Search - Classification Alignments
Why Model Evaluation ? Just because it’s pretty doesn’t make it right ! Topics: correct fold model coverage (%) C - deviation (rmsd) alignment accuracy (%) side chain placement
1.Errors in side-chain packing. 2.Shifts of correctly aligned residues. 3.Regions without template. 4.Errors due to misalignments. 5.Errors produced by incorrect templates. Types of Errors
EVA Evaluation of Automatic protein structure prediction [ Burkhard Rost, Andrej Sali, ] CASP Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction 3D - Crunch Very Large Scale Protein Modelling Project Model Accuracy Evaluation
Hidropaticity Plot 2D prediction Evaluate de model MODEL Predicción de “fold” Pseudo Potencial Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs “fold” prediction 2D Structure Comparative modelling - Composer - Modeler SCR & VR SCR VR - Ring Closure - DB Search - Classification Alignments
Hidropaticity Plot 2D prediction Evaluate de model MODEL Predicción de “fold” Pseudo Potencial Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs “fold” prediction 2D Structure Comparative modelling - Composer - Modeler SCR & VR SCR VR - Ring Closure - DB Search - Classification Alignments Side-chains - Multi-Copy
c) Side Chain placement Find the most probable side chain conformation, using homologues structures back-bone dependent rotamer libraries energetic and packing criteria Rigid Body assembly
Hidropaticity Plot 2D prediction Evaluate de model MODEL Predicción de “fold” Pseudo Potencial Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs “fold” prediction 2D Structure Comparative modelling - Composer - Modeler SCR & VR SCR VR - Ring Closure - DB Search - Classification Alignments Side-chains - Multi-Copy
Hidropaticity Plot 2D prediction Evaluate de model MODEL Predicción de “fold” Pseudo Potencial Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs “fold” prediction 2D Structure Comparative modelling - Composer - Modeler SCR & VR SCR VR - Ring Closure - DB Search - Classification Alignments Side-chains - Multi-Copy Optimization - Minimization E - Simulated Annealing
modeling will produce unfavorable contacts and bonds idealization of local bond and angle geometry extensive energy minimization will move coordinates away keep it to a minimum SwissModel is using GROMOS 96 force field for a steepest descent d) Energy minimization Rigid Body assembly
OPTIMIZATION 1- Minimization by Newton-Rapson X i+1 = X i + grad(E) 2- Simulated Annealing
Optimization schedule and progress [ A. Sali & T. Blundel (1993) JMB 234, 779 – 815 ] Modeling by Satisfaction of Spatial restraints
Hidropaticity Plot 2D prediction Evaluate de model MODEL Predicción de “fold” Pseudo Potencial Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs “fold” prediction 2D Structure Comparative modelling - Composer - Modeler SCR & VR SCR VR - Ring Closure - DB Search - Classification Alignments Side-chains - Multi-Copy Optimization - Minimization E - Simulated Annealing
Hidropaticity Plot 2D prediction Evaluate de model MODEL Predicción de “fold” Pseudo Potencial Nº sequences > Nº folds Structure 3D Classification - Homologs - Remotes - Analogs “fold” prediction 2D Structure Comparative modelling - Composer - Modeler SCR & VR SCR VR - Ring Closure - DB Search - Classification Alignments Side-chains - Multi-Copy Optimization - Minimization E - Simulated Annealing Molecular Dynamics
MOLECULAR DYNAMICS Finite differences algorithm Coupling to a thermal bath Statistical Mechanics Ergodic Hipothesis
EXAMPLE
clevage clevage + Activation segment activation Enzime PRO-CARBOXIPEPTIDASA
PRO-CARBOXYPEPTIDASES Porcine Pro Carboxypeptidase B PCPBpPorcine Pro Carboxypeptidase A1 PCPA1pBovine PCPA1b
PCPA2h -ETFVGDQVLEIVPSNEEQIKNLLQLEAQEHLQLDFWKSPTTPGETAHVRVPFVNVQAVKVFLESQGIAYSIMIEDVQVL PCPA2h PCPA1b KEDFVGHQVLRITAADEAEVQTVKELEDLEHLQLDFWRGPGQPGSPIDVRVPFPSLQAVKVFLEAHGIRYRIMIEDVQSL PCPA1b PCPA1p KEDFVGHQVLRISVDDEAQVQKVKELEDLEHLQLDFWRGPARPGFPIDVRVPFPSIQAVKVFLEAHGIRYTIMIEDVQLL PCPA1p PCPA2h LDKENEEMLFNRRRERSGN-FNFGAYHTLEEISQEMDNLVAEHPGLVSKVNIGSSFENRPMNVLKFSTGG-DKPAIWLDA PCPA2h PCPA1b LDEEQEQMFASQSRARSTNTFNYATYHTLDEIYDFMDLLVAEHPQLVSKLQIGRSYEGRPIYVLKFSTGGSNRPAIWIDL PCPA1b PCPA1p LDEEQEQMFASQGRARTTSTFNYATYHTLEEIYDFMDILVAEHPALVSKLQIGRSYEGRPIYVLKFSTGGSNRPAIWIDS PCPA1p PCPA2h GIHAREWVTQATALWTANKIVSDYGKDPSITSILDALDIFLLPVTNPDGYVFSQTKNRMWRKTRSKVSGSLCVGVDPNRN PCPA2h PCPA1b GIHSREWITQATGVWFAKKFTEDYGQDPSFTAILDSMDIFLEIVTNPDGFAFTHSQNRLWRKTRSVTSSSLCVGVDANRN PCPA1b PCPA1p GIXSRXWITQASGVWFAKKITENYGQNSSFTAILDSMDIFLEIVTNPNGFAFTHSDNRLWRKTRSKASGSLCVGSDSNRN PCPA1p PCPA2h WDAGFGGPGASSNPCSDSYHGPSANSEVEVKSIVDFIKSHGKVKAFIILHSYSQLLMFPYGYKCTKLDDFDELSEVAQKA PCPA2h PCPA1b WDAGFGKAGASSSPCSETYHGKYANSEVEVKSIVDFVKDHGNFKAFLSIHSYSQLLLYPYGYTTQSIPDKTELNQVAKSA PCPA1b PCPA1p WDAGFGGAGASSSPCAETYHGKYPNSEVEVKSITDFVKNNGNIKAFISIXSYSQLLLYPYGYKTQSPADKSELNQIAKSA PCPA1p PCPA2h AQSLRSLHGTKYKVGPICSVIYQASGGSIDWSYDYGIKYSFAFELRDTGRYGFLLPARQILPTAEETWLGLKAIMEHVRD PCPA2h PCPA1b VEALKSLYGTSYKYGSIITTIYQASGGSIDWSYNQGIKYSFTFELRDTGRYGFLLPASQIIPTAQETWLGVLTIMEHTLN PCPA1b PCPA1p VAALKSLYGTSYKYGSIITVIYQASGGVIDWTYNQGIKYSFSFELRDTGRRGFLLPASQIIPTAQETWLALLTIMEHTLN PCPA1p SEQUENCE ALIGNMENT OF PRO-CARBOXYPEPTIDASES
ComparativeModelling Model building of conserved regions Baldomero Oliva Miguel UniversitatPompeuFabra
UniversitatPompeuFabra ComparativeModelling Model building of conserved regions
Add and model build loop regions Baldomero Oliva Miguel UniversitatPompeuFabra ComparativeModelling Model building of conserved regions
Baldomero Oliva Miguel UniversitatPompeuFabra Add and model build loop regions ComparativeModelling Model building of conserved regions
Secondary Structure Prediction and Multiple Alignment of Pro-Carboxypeptidases EEEEE HHHHHHHHHHHHHHH EEEE EEEEE HHHHHHHHHH EEEEHHHHHHHH PCPA2h PHD PCPA2h -ETFVGDQVLEIVPSNEEQIKNLLQLEAQEHLQLDFWKSPTTPGETAHVRVPFVNVQAVKVFLESQGIAYSIMIEDVQVL PCPA2h PCPA1b KEDFVGHQVLRITAADEAEVQTVKELEDLEHLQLDFWRGPGQPGSPIDVRVPFPSLQAVKVFLEAHGIRYRIMIEDVQSL PCPA1b PCPA1p KEDFVGHQVLRISVDDEAQVQKVKELEDLEHLQLDFWRGPARPGFPIDVRVPFPSIQAVKVFLEAHGIRYTIMIEDVQLL PCPA1p HHHHHHHHHHHHHH PCPA2h PHD PCPA2h LDKENEEMLFNRRRERSGN-FNFGAYHTLEEISQEMDNLVAEHPGLVSKVNIGSSFENRPMNVLKFSTGG-DKPAIWLDA PCPA2h PCPA1b LDEEQEQMFASQSRARSTNTFNYATYHTLDEIYDFMDLLVAEHPQLVSKLQIGRSYEGRPIYVLKFSTGGSNRPAIWIDL PCPA1b PCPA1p LDEEQEQMFASQGRARTTSTFNYATYHTLEEIYDFMDILVAEHPALVSKLQIGRSYEGRPIYVLKFSTGGSNRPAIWIDS PCPA1p PCPA2h GIHAREWVTQATALWTANKIVSDYGKDPSITSILDALDIFLLPVTNPDGYVFSQTKNRMWRKTRSKVSGSLCVGVDPNRN PCPA2h PCPA1b GIHSREWITQATGVWFAKKFTEDYGQDPSFTAILDSMDIFLEIVTNPDGFAFTHSQNRLWRKTRSVTSSSLCVGVDANRN PCPA1b PCPA1p GIXSRXWITQASGVWFAKKITENYGQNSSFTAILDSMDIFLEIVTNPNGFAFTHSDNRLWRKTRSKASGSLCVGSDSNRN PCPA1p PCPA2h WDAGFGGPGASSNPCSDSYHGPSANSEVEVKSIVDFIKSHGKVKAFIILHSYSQLLMFPYGYKCTKLDDFDELSEVAQKA PCPA2h PCPA1b WDAGFGKAGASSSPCSETYHGKYANSEVEVKSIVDFVKDHGNFKAFLSIHSYSQLLLYPYGYTTQSIPDKTELNQVAKSA PCPA1b PCPA1p WDAGFGGAGASSSPCAETYHGKYPNSEVEVKSITDFVKNNGNIKAFISIXSYSQLLLYPYGYKTQSPADKSELNQIAKSA PCPA1p PCPA2h AQSLRSLHGTKYKVGPICSVIYQASGGSIDWSYDYGIKYSFAFELRDTGRYGFLLPARQILPTAEETWLGLKAIMEHVRD PCPA2h PCPA1b VEALKSLYGTSYKYGSIITTIYQASGGSIDWSYNQGIKYSFTFELRDTGRYGFLLPASQIIPTAQETWLGVLTIMEHTLN PCPA1b PCPA1p VAALKSLYGTSYKYGSIITVIYQASGGVIDWTYNQGIKYSFSFELRDTGRRGFLLPASQIIPTAQETWLALLTIMEHTLN PCPA1p
Pseudo-energy(threading) Wrong model Evaluation
G (C’) N (C’’) S (Ccap ) R (C1) E (C2) R (C3) R (C5) R (C4) Schellman Motif Profile C´´ » C3 / C´ Gly C3 C2 C1 Ccap C’ C’’ L Q E H G I L Q E H G I A R A K G V A R A K G V M K K L G K M K K L G K pont d’ hidrògen Interacció hidrofòbica N R R R E R S G N F N ponts d’hidrògen C3 Ccap C’’ ^ ^ ^ -Helix C-cap Schellman Motif Schellman Motif
Refinement of the Model
pseudo-energy residue IMPROVED
Protein Structure Resources PDBhttp:// PDB – Protein Data Bank of experimentally solved structures (RCSB) CATHhttp:// Hierarchical classification of protein domain structures SCOPhttp://scop.mrc-lmb.cam.ac.uk/scop/ Alexey Murzin’s Structural Classification of proteins DALIhttp://www2.ebi.ac.uk/dali/ Lisa Holm and Chris Sander’s protein structure comparison server SS-Prediction and Fold Recognition PHDhttp://cubic.bioc.columbia.edu/predictprotein/ Burkhard Rost’s Secondary Structure and Solvent Accessibility Prediction Server 3DPSSMhttp:// Fold Recognition Server using 1D and 3D Sequence Profiles coupled with Secondary Structure and Solvation Potential Information.
Protein Homology Modeling Resources SWISS MODEL: Deep View - SPDBV: homepage: Tutorials WhatIfhttp:// Gert Vriend’s protein structure modeling analysis program WhatIf Modeller: Andrej Sali's homology protein structure modelling by satisfaction of spatial restraints FAMS: Full Automatic Modelling System (FAMS); Kitasato University; Tokyo, Japan 3D-JIGSAW: Comparative Modelling Server; Imperial Cancer Research Fund; London, UK CPHmodels: Centre for Biological Sequence Analysis; The Technical University of Denmark; Denmark SDSC1: SDSC Structure Homology Modelling Server; San Diego Supercomputing Centre
Protein Homology Modeling Resources SWISS MODEL: Example Dear baldo, Title of your Request: subtilisin Swiss-Model '(SWISS-MODEL Version )' started on 27-Oct-2003 Process identification is AAAa08uiF The modelling procedure is now in progress, and its results should be sent to you shortly. In case of difficulties, please contact: Torsten Schwede Nicolas Guex
Protein Homology Modeling Resources SWISS MODEL: Example Length of target sequence: 379 residues Searching sequences of known 3D structures Found 1c3lA.pdb with P(N)=0 Found 1be8_.pdb with P(N)=0 Found 1cseE.pdb with P(N)=0 Found 1scd_.pdb with P(N)=0 Found 1sbc_.pdb with P(N)=0 Found 1be6_.pdb with P(N)=0 …….
Protein Homology Modeling Resources 3D JigSaw Example This is an automatic split of your sequence in protein domains Follow the link below and select the modelling templates for each modelable domain 2 domain/s found in your query sequence (2 have homologous templates). See them at: (this file will be erased in a week time) Your submission: >subtilisin MMRKKSFWFGMLTAFMLVFTMEFSDSASAAQPGKNVEKDYFVGFKSGVKTASVKKDIIKE SGGKVDKQFRIINAAKATLDKEALKEVKNDPDVAYVEEDHVAHALGQTVPYGIPLIKADK VQAQGFKGANVKVAVLDTGIQASHPDLNVVGGASFVAGEAYNTDGNGHGTHVAGTVAALD NTTGVLGVAPSVSLFAVKVLNSSGSGSYSGIVSGIEWATTNGMDVINMSLGGPSGSTAMK QAVDNAYSKGVVPVAAAGNSGSSGYTNTIGYPAKYDSVIAVGAVDSNSNRASFSSVGAEL EVMAPGAGVYSTYPTNTYATLNGTSMASPHVAGAAALILSKHPNLSASQVRNRLSSTATY LGSSFYYGKGLINVEAAAQ Send comments See you soon