Protein homology modeling Morten Nielsen, CBS, BioCentrum, DTU.

Slides:



Advertisements
Similar presentations
Hidden Markov Models What are the good for? Morten Nielsen CBS.
Advertisements

Protein Fold recognition Morten Nielsen, CBS, BioSys, DTU.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Fold Recognition Ole Lund, Assistant professor, CBS.
Biological sequence analysis and information processing by artificial neural networks Morten Nielsen CBS.
Protein structure and homology modeling Morten Nielsen, CBS, BioCentrum, DTU.
Protein structure (Part 2 of 2).
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU.
Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein Homology Modelling Thomas Blicher Center for Biological Sequence Analysis.
Thomas Blicher Center for Biological Sequence Analysis
Fold Recognition Ole Lund, Associate professor, CBS.
Protein Fold recognition
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
The Protein Data Bank (PDB)
Introduction to bioinformatics
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Molecular modelling / structure prediction (A computational approach to protein structure) Today: Why bother about proteins/prediction Concepts of molecular.
Similar Sequence Similar Function Charles Yan Spring 2006.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modelling Thomas Blicher Center for Biological Sequence Analysis.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Tertiary Structure Prediction
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
COMPARATIVE or HOMOLOGY MODELING
Protein Sequence Alignment and Database Searching.
Lecture 10 – protein structure prediction. A protein sequence.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Comp. Genomics Recitation 3 The statistics of database searching.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Structure prediction: Homology modeling
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.
Predicting Protein Structure: Comparative Modeling (homology modeling)
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
Programme Last week’s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues Summary.
Homology Modeling 原理、流程,還有如何用該工具去預測三級結構 Lu Chih-Hao 1 1.
Sequence Alignment.
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Psi-Blast Morten Nielsen, Department of systems biology, DTU.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Prediction of T cell epitopes using artificial neural networks Morten Nielsen, CBS, BioCentrum, DTU.
3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.
Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.
Outline Basic Local Alignment Search Tool
PROTEIN MODELLING Presented by Sadhana S.
Protein Structure Visualisation
Protein Structure Prediction and Protein Homology modeling
Outline Basic Local Alignment Search Tool
Homology Modeling.
Protein structure prediction.
Programme Last week’s quiz results + Summary
Protein Homology Modelling
Presentation transcript:

Protein homology modeling Morten Nielsen, CBS, BioCentrum, DTU

Objectives Understand the basic concepts of homology modeling Learn why even sequences with very low sequence similarity can be modeled –Understand why is %id such a terrible measure for reliability See the beauty of sequence profiles

Background. Why protein modeling? Because it works! –Close to 50% of all new sequences can be homology modeled Experimental effort to determine protein structure is very large and costly The gap between the size of the protein sequence data and protein structure data is large and increasing

Homology modeling and the human genome

Swiss-Prot database ~ in Swiss-Prot ~ if include Tremble

PDB New Fold Growth The number of unique folds in nature is fairly small (possibly a few thousands) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB Number of new folds is NOT growing New folds Old folds New PDB structures

Worldwide Structural Genomics ”Fold space coverage” Complete genomes Signaling proteins Improving technology Disease-causing organisms Model organisms Membrane proteins Protein-ligand interactions

Structural Genomics in North America 10 year $600 million project initiated in 2000, funded largely by NIH AIM: structural information on unique proteins (now ), so far 1000 have been determined Improve current techniques to reduce time (from months to days) and cost (from $ to $20.000/structure) 9 research centers currently funded (2005), targets are from model and disease-causing organisms (a separate project on TB proteins)

Homology modeling for structural genomics Roberto Sánchez et al. Nature Structural Biology 7, (2000) What a new fold can give

Sali, A. & Kuriyan, J. Trends Biochem. Sci. 22, M20–M24 (1999) How well can we do it?

Homology modeling. Why can we do it? The structure of a protein is uniquely determined by its amino acid sequence (but sequence is sometimes not enough): –prions –pH, ions, cofactors, chaperones Structure is conserved much longer than sequence in evolution

Identification of fold If sequence similarity is high proteins share structure (Safe zone) If sequence similarity is low proteins may share structure (Twilight zone) Most proteins do not have a high sequence homologous partner Rajesh Nair & Burkhard Rost Protein Science, 2002, 11,

Example. >1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL What is the function Where is the active site? A post doc in our group Anne Mølgaard did her PhD obtaining the structure of the sequence below

Could she have saved three years work?. Function Run Blast against PDB No significant hits Run Blast against NR (Sequence database) Function is Acetylesterase? Where is the active site?

Example. Where is the active site? 1WAB Acetylhydrolase 1G66 Acetylxylan esterase 1USW Hydrolase

Example. Where is the active site? Align sequence against structures of known acetylesterase, like 1WAB, 1FXW, … Cannot be aligned. Too low sequence similarity 1K7C.A 1WAB._ RMSD QAL 1K7C.A 71 GHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTF DAL 1WAB._ 160 GHPRAHFLDADPGFVHSDGTISH--HDMYDYLHLSRLGY

Is it really impossible? Worked for 2-3 years in SBI-AT developing methods for homology modeling in the twilight zone Shown that homology modeling is possible also for very low sequence homology So, try to show that Anne could have saved 3 years work if she had used the most advanced homology modeling techniques

How can we do it? Identify template(s) – initial alignment Can give you protein function Improve alignment Can give you active site Backbone generation Loop modeling Most difficult part Side chains Refinement Validation

How to do it Identify fold (template) for modeling –Find the structure in the PDB database that resembles your new protein the most –Can be used to predict function –And maybe active sites Align protein sequence to template –Simple alignment methods –Sequence profiles –Threading methods –Pseudo force fields Model side chains and loops My work Cristinas work

Protein homology modeling is only possible if %id greater than 30-50% WRONG!!!!!!!

Why %id is so bad!! 1200 models sharing 25-95% sequence identity with the submitted sequences (

Identification of correct fold % ID is a poor measure –Many evolutionary related proteins share low sequence homology –A short alignment of 5 amino acids can share 100% id, what does this mean? Alignment score even worse –Many sequences will score high against every thing (hydrophobic stretches) P-value or E-value more reliable

What are P and E values? E-value –Number of expected hits in database with score higher than match –Depends on database size P-value –Probability that a random hit will have score higher than match –Database size independent Score P(Score) Score hits with higher score (E=10) hits in database => P=10/10000 = 0.001

Template identification Simple sequence based methods –Align (BLAST) sequence against sequence of proteins with known structure (PDB database) Sequence profile based methods –Align sequence profile (Psi-BLAST) against sequence of proteins with known structure (PDB, FUGUE) –Align sequence profile against profile of proteins with known structure (FFAS) Sequence and structure based methods –Align profile and predicted secondary structure against proteins with known structure (3D-PSSM, Phyre) Sequence profiles and structure based methods –Our work

Template quality Selecting the best template is crucial! The best template may not be the one with the highest % id (best p-value…) –Template 1: 93% id, 3.5 Å resolution –Template 2: 90% id, 1.5 Å resolution

Template quality – Ramachandran plot

What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences This scoring matrix is identical at all positions in the protein sequence! EVVFIGDSLVQLMHQC X X AGDS.GGGDSAGDS.GGGDS

When Blast works! 1PLC._ 1PLB._

When Blast fails! 1PLC._ 1PMY._

When Blast fails, use sequence profiles! 1PLC._ 1PMY._

Blosum scoring matrix A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Sequence profiles In reality not all positions in a protein are equally likely to mutate Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score Sequence profiles can capture these differences

Protein world Protein fold Protein structure classification Protein superfamily Protein family New Fold

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Sequence profiles Conserved Non-conserved Matching any thing but G => large negative score Any thing can match TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

How to make sequence profiles Align (BLAST) sequence against large sequence database (Swiss-Prot) Select significant alignments and make profile (weight matrix) using techniques for sequence weighting and pseudo counts Use weight matrix to align against sequence database to find new significant hits Repeat 2 and 3 (normally 3 times!)

And how to really do it? Make profile (three iterations) blastpgp -i fastafile -d nr -j 4 -e 1e-5 -C restart.file Run profile against database blastpgp -i fastafile -d db.fsa -R restart.file

Sequence profiles (1J2J.B) 0 iterations (Blosum62) 2 iterations 1 iterations 3 iterations

Example. Annes sequence (SGNH active site)

Example. Where is the active site? Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195

Profile-profile scoring matrix 1K7C.A 1WAB._

Example. Where is the active site? Align using sequence profiles ALN 1K7C.A 1WAB._ RMSD = % ID 1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN S G N 1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA 1WAB._ HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP 1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL H 1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L

Structural superposition Blue: 1K7C.A Red: 1WAB._

Where was the active site? Rhamnogalacturonan acetylesterase (1k7c)

Including structure Sequence with in a protein superfamily share remote sequence homology, but they share high structural homology Structure is known for template Predict structural properties for query –Secondary structure –Surface exposure Position specific gap penalties derived from secondary structure and surface exposure

Using structure Sequence & structure profile-profile based alignments –Template Sequence based profiles Annotated secondary structure Predicted secondary structure –Query Sequence based profile Predicted secondary structure –Position specific gap penalties derived from secondary structure

How good are we?

Alignment accuracy. Scoring functions Blosum62 score matrix. Fg=1. Ng=0? Score = =17 Alignment LAGDSD F I G-4060 D S L LAGDS I-GDS

Alignment accuracy. Scoring functions How to make the most of sequence profiles? A R N D C Q E G H I L K M F P S T W Y V T 1K7C.A T 1K7C.A V 1K7C.A ….. G 1K7C.A D 1K7C.A S 1K7C.A E 1WAB._ V 1WAB._ …. G 1WAB._ D 1WAB._ S 1WAB._

Alignment accuracy

Fold recognition Benchmark –Query set of 100 train set, 200 test set –Database of 355 PDB structures –Align Query against Db If structural similar hit = 1, else hit = 0 –Use CE to define structural similar Calculate AUC (area under the ROC curve) –Perfect method can separate hits from non-hits How to rank hits? –Alignment score? –%Id –Z score (p-value)

CE structural alignment (combinatorial extension)

AUC performance measure Query Templ Score Hit/nonhit 1CJ0.A 1B78.A CJ0.A 1B8A.A CJ0.A 1B8B.A CJ0.A 1B8G.A CJ0.A 1B9H.A CJ0.A 1BAR.A CJ0.A 1BAV.C Query Templ Score Hit/nonhit 1CJ0.A 1B8G.A CJ0.A 1DTY.A CJ0.A 1DGD._ CJ0.A 1GTX.A CJ0.A 2GSA.A CJ0.A 1BW9.A CJ0.A 1AUP._ CJ0.A 1GTM.A

Fold recognition performance

Outlook Include position dependent gap penalties The method now uses equal gap penalties through out the scoring matrix In real proteins placement of insertions and deletions is highly structure dependent No gaps in secondary structure elements Gaps most frequent in loops

CASP. Which are the best methods Critical Assessment of Structure Predictions Every second year Sequences from about-to-be-solved- structures are given to groups who submit their predictions before the structure is published Modelers make prediction Meeting in December where correct answers are revealed

CASP6 results

The top 4 homology modeling groups in CASP6 All winners use consensus predictions – The wisdom of the crowd Same approach as in CASP5! Nothing has happened in 2 years!

The Wisdom of the Crowds The Wisdom of Crowds. Why the Many are Smarter than the Few. James Surowiecki One day in the fall of 1906, the British scientist Fracis Galton left his home and headed for a country fair… He believed that only a very few people had the characteristics necessary to keep societies healthy. He had devoted much of his career to measuring those characteristics, in fact, in order to prove that the vast majority of people did not have them. … Galton came across a weight-judging competition…Eight hundred people tried their luck. They were a diverse lot, butchers, farmers, clerks and many other no-experts…The crowd had guessed … pounds, the ox weighted 1.198

The wisdom of the crowd! –The highest scoring hit will often be wrong Not one single prediction method is consistently best –Many prediction methods will have the correct fold among the top hits –If many different prediction methods all have a common fold among the top hits, this fold is probably correct

3D-Jury (Best group) Inspired by Ab initio modeling methods –Average of frequently obtained low energy structures is often closer to the native structure than the lowest energy structure Find most abundant high scoring model in a list of prediction from several predictors 1.Use output from a set of servers 2.Superimpose all pairs of structures 3.Similarity score S ij = # of C a pairs within 3.5Å (if #>40;else S ij =0) 4.3D-Jury score = S ij S ij /(N+1) Similar methods developed by A Elofsson (Pcons) and D Fischer (3D shotgun)

How to do it? Where is the crowd Meta prediction server –Web interface to a list of public protein structure prediction servers –Submit query sequence to all selected servers in one go

Meta Server Evaluating the crowd.

Meta Server Evaluating the crowd. 3D Jury

Take home message Identifying the correct fold is only a small step towards successful homology modeling Do not trust % ID or alignment score to identify the fold. Use p-values Use sequence profiles and local protein structure to align sequences Do not trust one single prediction method, use consensus methods (3D Jury) Only if every things fail, use ab initio methods