Download presentation
Presentation is loading. Please wait.
1
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU
2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Outline Why model protein structure Classification of protein structures –Fold, Superfamily, Family Protein homology modeling –Template (fold) recognition –Alignment –Side chain modeling –Loop modeling Reliability measures –%id bad, P-value good Historical overview –Blast (simple alignment) –Psi Blast (profiles) –Profile-profile alignment –Structural features –Recombinant or democratic homology modeling Best methods
3
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Why protein modeling? Experimental effort to determine protein structure is very large and costly The gap between the size of the protein sequence data and protein structure data is large and increasing Close to 50% of all new sequences can be homology modeled
4
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Swiss-Prot database
5
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU PDB New Fold Growth The number of unique folds in nature is fairly small (possibly a few thousands) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB New folds Old folds New PDB structures
6
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein classification Number of protein sequences grow exponentially Number of solved structures grow exponentially Number of new folds identified very small (and close to constant) Protein classification can –Generate overview of structure types –Detect similarities (evolutionary relationships) between protein sequences
7
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein world Protein fold Protein structure classification Protein superfamily Protein family New Fold
8
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Classification schemes SCOP –Manual classification (A. Murzin) CATH –Semi manual classification (C. Orengo) FSSP –Automatic classification (L. Holm)
9
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Levels in SCOP Class# Folds# Superfamilies # Families All alpha proteins202342550 All beta proteins141280529 Alpha and beta proteins (a/b)130213593 Alpha and beta proteins (a+b)260386650 Multi-domain proteins404055 Membrane and cell surface proteins428291 Small proteins72104162 Total88714472630 http://scop.berkeley.edu/count.html#scop-1.67
10
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Major classes in SCOP Classes –All alpha proteins –Alpha and beta proteins (a/b) –Alpha and beta proteins (a+b) –Multi-domain proteins –Membrane and cell surface proteins –Small proteins
11
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU All : Hemoglobin (1bab)
12
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU All : Immunoglobulin (8fab)
13
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Triosephosphate isomerase (1hti)
14
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU : Lysozyme (1jsf)
15
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Families Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity) Families are further subdivided in to Proteins Proteins are divided into Species –The same protein may be found in several species
16
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Superfamilies Proteins which are (remote) evolutionarily related –Sequence similarity low –Share function –Share special structural features Relationships between members of a superfamily may not be readily recognizable from the sequence alone
17
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Folds * Proteins which have >~50% of their secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same fold No evolutionary relation between proteins *confusingly also called fold classes
18
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Links PDB (protein structure database) –www.rcsb.org/pdb/www.rcsb.org/pdb/ SCOP (protein classification database) –scop.berkeley.eduscop.berkeley.edu CATH (protein classification database) –www.biochem.ucl.ac.uk/bsm/cathwww.biochem.ucl.ac.uk/bsm/cath FSSP (protein classification database) –www.ebi.ac.uk/dali/fssp/fssp.htmlwww.ebi.ac.uk/dali/fssp/fssp.html
19
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Superfamilies Proteins which are (remote) evolutionarily related –Sequence similarity low –Share function –Share special structural features Relationships between members of a superfamily may not be readily recognizable from the sequence alone Fold Family Superfamily Proteins
20
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Model accuracy. Swiss-model. 1200 models sharing 25-95% sequence identity with the submitted sequences (www.expasy.ch/swissmod)
21
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Identification of fold If sequence similarity is high proteins share structure (Safe zone) If sequence similarity is low proteins may share structure (Twilight zone) Most proteins do not have a high sequence homologous partner Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47
22
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Identification of correct fold % ID is a poor measure –Many evolutionary related proteins share low sequence homology Alignment score even worse –Many sequence will score high against every thing (hydrophobic stretches) P-value or E-value more reliable
23
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU P and E values E-value –Number of expected hits in database with score higher than match –Depends on database size P-value –Probability that a random hit will have score higher than match –Database size independent Score P(Score) Score 150 10 hits with higher score (E=10) 10000 hits in database => P=10/10000 = 0.001
24
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein Homology modeling Identify fold (template) for modeling –Find the structure in the PDB database that resembles the unknown structure the most –Can be used to predict function Align protein sequence to template –Simple alignment methods –Sequence profiles –Threading methods –Pseudo force fields Model side chains and loops
25
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Template identification Simple sequence based methods –Align (BLAST) sequence against sequence of proteins with known structure (PDB database) Sequence profile based methods –Align sequence profile (Psi-BLAST) against sequence of proteins with known structure (PDB) –Align sequence profile against profile of proteins with known structure (FFAS) Sequence and structure based methods –Align profile and predicted secondary structure against proteins with known structure (3D-PSSM)
26
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Template identification Threading methods –Align sequence against structural environment of proteins with known structure Use biological information –Functional annotation in databases –Active sites
27
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence profiles In conventional alignment, a scoring matrix (BLOSUM62) gives the score for matching two amino acids –In reality not all positions in a protein are equally likely to mutate –Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high –Other amino acids are mutate almost for free, and the score for mismatch is lower than the BLOSUM score Sequence profiles can capture this
28
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Sequence profiles Conserved Non-conserved Matching any thing but G => large negative score Any thing can match TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP
29
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence profiles 1.Align (BLAST) sequence against large sequence database (Swiss-Prot) 2.Select significant alignments and make profile (weight matrix) using techniques for sequence weighting and pseudo counts (see lecture on HMM’s) 3.Use weight matrix to align against sequence database to find new significant hits 4.Repeat 2 and 3 until stop criteria
30
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU PDB-BLAST Procedure 1.Build sequence profile by iterative PSI- BLAST search against a sequence database 2.Use profile to search database of proteins with known structure (PDB)
31
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Transitive BLAST Procedure 1.Find homologues to query (your) sequence 2.Find homologues to these homologues 3.Etc. –Can be implemented with e.g. BLAST or PSI- BLAST Also known as Intermediate Sequence Search (ISS)
32
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example Sequence profiles Alignment of protein sequences 1PLC._ and 1GYC.A E-value > 1000 Profile alignment –Align 1PLC._ against Swiss-prot –Make position specific weight matrix from alignment –Use this matrix to align 1PLC._ against 1GYC.A E-value < 10 -22. Rmsd=3.3
33
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence profiles Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) 1PLC._: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + + 1GYC.A: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 1PLC._: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G V 1GYC.A: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126 Rmsd=3.3 Å Model red Template blue
34
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Including structure Sequence with in a protein superfamily share remote sequence homology, but they share high structural homology Structure is known for template Predict structural properties for query –Secondary structure –Surface exposure
35
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Using structure Sequence&structure profile-profile based alignments –Template profiles Multiple structure alignments Sequence based profiles –Query profile Sequence based profile Predicted secondary structure –Position specific gap penalties derived from secondary structure
36
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Structure biased alignment (3D-PSSM) http://www.sbg.bio.ic.ac.uk/~3dpssm/
37
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Threading 1 7 8 9 10 3 4 5 6 2.. A T N L Y K E T L.. Deletions Insertion Alignment score from structural fitness (pair potential) How well does K fit environment at P6? If P8 is acidic then fine, if P8 is basic then poor
38
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Threading Threading does not work –The average protein does not exist Threading can be used in combination with sequence profiles, local structural features to improve alignment
39
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU CASP –Critical Assessment of Structure Predictions –Every second year –Sequences from about-to-be-solved-structures are given to groups who submit their predictions before the structure is published –Modelers make prediction –Meeting in December where correct answers are revealed
40
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU CASP5 overview
41
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Successful fold recognition groups at CASP5 3D-Jury (Leszek Rychlewski) 3D-CAM (Krzysztof Ginalski) Template recombination (Paul Bates) HMAP (Barry Honig) PROSPECT (Ying Xu) ATOME (Gilles Labesse)
42
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Democratic homology modeling Let the silent majority rule –The highest score hit will often be wrong –Many prediction methods will have the correct fold among the top 10-20 hits –If many different prediction methods all have some fold among the top hits, this fold is probably correct
43
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU 3D-Jury (Rychlewski) Inspired by Ab initio modeling methods –Average of frequently obtained low energy structures is often closer to the native structure than the lowest energy structure Find most abundant high scoring model in a list of prediction from several predictors 1.Use output from a set of servers 2.Superimpose all pairs of structures 3.Similarity score S ij = # of C a pairs within 3.5Å (if #>40;else S ij =0) 4.3D-Jury score = ij S ij /(N+1) Similar methods developed by A Elofsson (Pcons) and D Fischer (3D shotgun)
44
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU LiveBench The Live Bench Project is a continuous benchmarking program. Every week sequences of newly released PDB proteins are being submitted to participating fold recognition servers. The results are collected and continuous evaluated using automated model assessment programs. A summary of the results is produced after several months of data collection. The servers must delay the updating of their structural template libraries by one week to participate
45
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Meta prediction server Web interface to a list of public protein structure prediction servers Submit query sequence to all selected servers in one go http://bioinfo.pl/meta/
46
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Meta Server
47
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Meta Server
48
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU 3D Jury
49
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU 188 targets in total Threshold for 5 false positives: 50 for 3D Jury
50
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Links to fold recognition servers Databases of links –http://bioinfo.pl/meta/servers.htmlhttp://bioinfo.pl/meta/servers.html –http://mmtsb.scripps.edu/cgi-bin/renderrelres?protmodelhttp://mmtsb.scripps.edu/cgi-bin/renderrelres?protmodel Meta server –http://bioinfo.pl/meta/http://bioinfo.pl/meta/ 3DPSSM – good graphical output –http://www.sbg.bio.ic.ac.uk/servers/3dpssm/http://www.sbg.bio.ic.ac.uk/servers/3dpssm/ GenTHREADER –http://bioinf.cs.ucl.ac.uk/psipred/http://bioinf.cs.ucl.ac.uk/psipred/ FUGUE2 –http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.htmlhttp://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html SAM –http://www.cse.ucsc.edu/research/compbio/HMM-apps/T99-query.htmlhttp://www.cse.ucsc.edu/research/compbio/HMM-apps/T99-query.html FOLD –http://fold.doe-mbi.ucla.edu/http://fold.doe-mbi.ucla.edu/ FFAS/PDBBLAST –http://bioinformatics.burnham-inst.org/http://bioinformatics.burnham-inst.org/
51
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU From fold to structure Flying to the moon has not made man conquer space Finding the right fold does not allow you to make accurate protein models –Can allow prediction of protein function Alignment is still a very hard problem –Most protein interactions are determined by the loops, and they are the least conserved parts of a protein structure
52
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Ab initio protein modeling Modeling of newfold proteins Only when every thing else fails Challenge Close to impossible to model Natures folding potential Example
53
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU New folds are in general constructed from a set of subunits, where each subunit is a part of a known fold. The subunits are small compared to the overall fold of the protein. No objective function exists to guide the global packing of the subunits. Challenge. Folding potential d ij = 6Å Objective function s ij = 120aa
54
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Fragments with correct local structure Natures potential Empirical potential A way to solution Glue structure piece wise from fragments. Guide process by empirical potential (Potential of mean force)
55
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Examples (Rosetta web server) www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php Rosetta prediction Homology modeling
56
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Take home message Identifying the correct fold is only a small step towards successful homology modeling Do not trust % ID or alignment score to identify the fold. Use p-values Use sequence profiles and local protein structure to align sequences Do not trust one single prediction method, use consensus methods (3D Jury) Only if every things fail, use ab initio methods
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.