Download presentation
Presentation is loading. Please wait.
1
Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU
2
Objectives Understand the basic concepts of fold recognition Learn why even sequences with very low sequence similarity can be modeled –Understand why is %id such a terrible measure for reliability See the beauty of sequence profiles –Position specific scoring matrices (PSSMs)
3
Objectives and..... See the beauty of sequence profiles –Position specific scoring matrices (PSSMs)
4
Background. Why protein modeling? Because it works! –Close to 50% of all new sequences can be homology modeled Experimental effort to determine protein structure is very large and costly The gap between the size of the protein sequence data and protein structure data is large and increasing
5
Growth of databases
6
Homology modeling and the human genome
7
How can we do it? Identify template(s) – initial alignment Can give you protein function Improve alignment Can give you active site Backbone generation Loop modeling Most difficult part Side chains Refinement Validation
8
Identification of fold If sequence similarity is high proteins share structure (Safe zone) If sequence similarity is low proteins may share structure (Twilight zone) Most proteins do not have a high sequence homologous partner Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47
9
Structural Genomics in North America 10 year $600 million project initiated in 2000, funded largely by NIH AIM: structural information on 10000 unique proteins (now 4-6000), so far 1000 have been determined Improve current techniques to reduce time (from months to days) and cost (from $100.000 to $20.000/structure) 9 research centers currently funded (2005), targets are from model and disease-causing organisms (a separate project on TB proteins)
10
Homology modeling for structural genomics Roberto Sánchez et al. Nature Structural Biology 7, 986 - 990 (2000) What a new fold can give
11
Example. >1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL What is the function Where is the active site? A post doc in our group did her PhD obtaining the structure of the sequence below
12
What would you do? Function Run Blast against PDB No significant hits Run Blast against NR (Sequence database) Function is Acetylesterase? Where is the active site?
13
Example. Where is the active site? 1WAB Acetylhydrolase 1G66 Acetylxylan esterase 1USW Hydrolase
14
Example. Where is the active site? Align sequence against structures of known acetylesterase, like 1WAB, 1FXW, … Cannot be aligned. Too low sequence similarity 1K7C.A 1WAB._ RMSD 11.2397 QAL 1K7C.A 71 GHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTF DAL 1WAB._ 160 GHPRAHFLDADPGFVHSDGTISH--HDMYDYLHLSRLGY
15
Is it really impossible? Protein homology modeling is only possible if %id greater than 30-50% WRONG!!!!!!!
16
Why %id is so bad!! 1200 models sharing 25-95% sequence identity with the submitted sequences (www.expasy.ch/swissmod)
17
Identification of correct fold % ID is a poor measure –Many evolutionary related proteins share low sequence homology –A short alignment of 5 amino acids can share 100% id, what does this mean? Alignment score even worse –Many sequences will score high against every thing (hydrophobic stretches) P-value or E-value more reliable
18
What are P and E values? E-value –Number of expected hits in database with score higher than match –Depends on database size P-value –Probability that a random hit will have score higher than match –Database size independent Score P(Score) Score 150 10 hits with higher score (E=10) 10000 hits in database => P=10/10000 = 0.001
19
What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences This scoring matrix is identical at all positions in the protein sequence! EVVFIGDSLVQLMHQC X X AGDS.GGGDSAGDS.GGGDS
20
Blosum scoring matrix A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
21
Alignment accuracy. Scoring functions Blosum62 score matrix. Fg=1. Ng=0? Score =2+6+6+4-1=17 Alignment LAGDSD F0-2-3 -2-3 I2-4-3-2-3 G-4060 D-4-2606 S-210040 L4-4 -2-4 LAGDS I-GDS
22
When Blast works! 1PLC._ 1PLB._
23
When Blast fails! 1PLC._ 1PMY._
24
Sequence profiles In reality not all positions in a protein are equally likely to mutate Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score Sequence profiles can capture these differences
25
Protein world Protein fold Protein structure hierarchy Protein superfamily Protein family New Fold
26
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Sequence profiles Conserved Non-conserved Matching any thing but G => large negative score Any thing can match TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDGMERNTAGVP
27
How to make sequence profiles Align (BLAST) sequence against large sequence database (Swiss-Prot) Select significant alignments and make profile (weight matrix) using techniques for sequence weighting and pseudo counts Use weight matrix to align against sequence database to find new significant hits Repeat 2 and 3 (normally 3 times!)
28
Protein world Blast iterations Protein
29
Blast2logo
31
Last position-specific scoring matrix computed A R N D C Q E G H I L K M F P S T W Y V 1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 2 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 4 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 5 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 6 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 7 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 9 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 10 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2. A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
32
Blast2logo
34
Last position-specific scoring matrix computed, A R N D C Q E G H I L K M F P S T W Y V 1 V -2 -4 -4 -5 -2 -4 -4 -5 -4 5 2 -4 0 -1 -4 -3 -2 -4 -2 4 2 A 5 0 -3 -3 -3 -2 1 -2 -3 0 -3 -2 -2 -4 0 0 -2 -4 -3 0 3 L -4 -5 -6 -6 -4 -5 -5 -6 -5 5 4 -5 1 -2 -5 -5 -3 -4 0 1 4 A 1 -4 -1 -1 3 -1 2 -4 -3 0 -1 -2 -3 1 -4 0 0 -4 2 2 5 E -2 0 -2 6 -6 0 4 -4 2 -5 -5 -2 -5 -6 -4 -2 0 -6 -4 -5 6 L -1 -2 -4 -4 -4 -2 -1 2 3 3 2 -1 0 -2 -5 -1 -1 -5 -3 1 7 Y -4 -5 -5 -6 -4 -5 -5 -4 0 1 4 -5 -1 3 -5 -5 -4 -3 5 3 8 I -1 -2 -5 -5 -4 -5 -2 -6 -5 4 3 -5 -1 3 -5 -4 -2 -4 -1 3 9 P 3 -4 -4 -3 -4 1 1 -4 -2 -2 -3 -2 -4 -5 6 -1 0 -5 -5 -2 10 E 2 -2 -3 -2 -3 0 1 -1 -3 -4 -3 -1 -1 -4 6 -2 -2 -4 -4 -3.
35
Example. >1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL What is the function Where is the active site?
36
When Blast fails! 1K7A.A 1WAB._
37
Profile-profile scoring matrix 1K7C.A 1WAB._
38
Example. (SGNH active site)
39
Example. Where is the active site? Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195
40
Example. Where is the active site? Align using sequence profiles ALN 1K7C.A 1WAB._ RMSD = 5.29522. 14% ID 1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN S G N 1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG------ 1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA 1WAB._ ---------------------HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP 1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL H 1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L
41
Structural superposition Blue: 1K7C.A Red: 1WAB._
42
Where was the active site? Rhamnogalacturonan acetylesterase (1k7c)
43
Using Iterative Blast
45
Using Iterative Blast (1st iteration)
46
Using Iterative Blast (3rd iteration)
47
Including structure Sequence with in a protein superfamily share remote sequence homology, but they share high structural homology Structure is known for template Predict structural properties for query –Secondary structure –Surface exposure Position specific gap penalties derived from secondary structure and surface exposure
48
Using structure Sequence & structure profile-profile based alignments –Template Sequence based profiles Annotated secondary structure Predicted secondary structure –Query Sequence based profile Predicted secondary structure –Position specific gap penalties derived from secondary structure
49
Handout exercise Using Psi-Blast Profiles
50
How good are we?
51
CpHModels-3.0 www.cbs.dtu.dk/services/CPHmodels-3.0/
52
CASP8 - Ranked as 15-20 best server
53
Why did we not win? Multiple template modeling First hit is not always the best Loop modeling...
54
What are the different methods? Simple sequence based methods –Align (BLAST) sequence against sequence of proteins with known structure (PDB database) Sequence profile based methods –Align sequence profile (Psi-BLAST) against sequence of proteins with known structure (PDB, FUGUE) –Align sequence profile against profile of proteins with known structure (FFAS) Sequence and structure based methods –Align profile and predicted secondary structure against proteins with known structure (3D-PSSM, Phyre) Sequence profiles and structure based methods –Hhpred, CpHModels Multiple template methods Modeler (via Hhpred, 3D jury)
55
Take home message Identifying the correct fold is only a small step towards successful homology modeling Do not trust % ID or alignment score to identify the fold. Use P-values You can do reliable fold recognition AND homology modeling when for low sequence homology Use sequence profiles and local protein structure to align sequences
56
CASP. Which are the best methods Critical Assessment of Structure Predictions Every second year Sequences from about-to-be-solved- structures are given to groups who submit their predictions before the structure is published Modelers make prediction Meeting in December where correct answers are revealed
57
CASP6 results
58
The top 4 homology modeling groups in CASP6 All winners use consensus predictions – The wisdom of the crowd Same approach as in earlier CASPs
59
The Wisdom of the Crowds The Wisdom of Crowds. Why the Many are Smarter than the Few. James Surowiecki One day in the fall of 1906, the British scientist Fracis Galton left his home and headed for a country fair… He believed that only a very few people had the characteristics necessary to keep societies healthy. He had devoted much of his career to measuring those characteristics, in fact, in order to prove that the vast majority of people did not have them. … Galton came across a weight-judging competition…Eight hundred people tried their luck. They were a diverse lot, butchers, farmers, clerks and many other no-experts…The crowd had guessed … 1.197 pounds, the ox weighted 1.198
60
The wisdom of the crowd! –The highest scoring hit will often be wrong Not one single prediction method is consistently best –Many prediction methods will have the correct fold among the top 10-20 hits –If many different prediction methods all have a common fold among the top hits, this fold is probably correct
61
3D-Jury Inspired by Ab initio modeling methods –Average of frequently obtained low energy structures is often closer to the native structure than the lowest energy structure Find most abundant high scoring model in a list of prediction from several predictors 1.Use output from a set of servers 2.Superimpose all pairs of structures 3.Similarity score S ij = # of C a pairs within 3.5Å (if #>40;else S ij =0) 4.3D-Jury score = S ij S ij /(N+1) Similar methods developed by A Elofsson (Pcons) and D Fischer (3D shotgun)
62
How to do it? Where is the crowd Meta prediction server –Web interface to a list of public protein structure prediction servers –Submit query sequence to all selected servers in one go http://bioinfo.pl/meta/
65
Meta Server Evaluating the crowd.
66
Meta Server Evaluating the crowd. 3D Jury
67
Take home message Identifying the correct fold is only a small step towards successful homology modeling Do not trust % ID or alignment score to identify the fold. Use p-values Use sequence profiles and local protein structure to align sequences Do not trust one single prediction method, use consensus methods (3D Jury)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.