1 Protein structure Prediction. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics.

1 Protein structure Prediction

2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc. John Wiley & Sons, Inc Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!

3 Levels of Protein Structure

4 Why is protein structure prediction needed? 3D structure determination is expensive, slow and difficult (by X-ray crystallography or NMR) Assists in the engineering of new proteins

5 Approaches to predicting protein structures ab initio –Use just first principles: energy, geometry, and kinematics Homology Comparative –Find the best match to a database of sequences with known 3D-structure Combinations Threading

6 Protein Data Bank PDB http://www.pdb.org http://www.pdb.org Database of templates Separate into single chains Remove bad structures (models) Create BLAST database Comparative Modeling Template(s) selection Sequence Alignment Structure Modeling Structure Evaluation Final Structural Models Target sequence Known Structures (templates)

7 Sequence Alignment Structure Modeling Structure Evaluation Final Structural Models Target sequence  Sequence Similarity / Fold recognition  Structure quality (resolution, experimental method)  Experimental conditions (ligands and cofactors) Comparative Modeling Template(s) selection

8 Known Structures (templates) Template(s) selection Structure Modeling Structure Evaluation Final Structural Models Target sequence  Key step in homology modeling  Global alignment is required  Small error in alignment can lead to big error in model  Multiple alignments are better than pairwise alignments Comparative Modeling Sequence Alignment

9 Known Structures (templates) Template(s) selection Structure Evaluation Final Structural Models Target sequence  Template based fragment Assembly (SwissMod).  Satisfaction of Spatial Restraints: MODELLER Comparative Modeling Sequence Alignment Structure Modeling

10 Known Structures (templates) Template(s) selection Sequence Alignment Structure Modeling Final Structural Models Target sequence  Errors in template selection or alignment result in bad models  Iterative cycles of alignment, modeling and evaluation Comparative Modeling Structure Evaluation

11 Measure Proteins Structure Similarity Need ways to determine if two protein structures are related and to compare predicted models to experimental structures Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979): Usually use C  atoms 3.6 Å2.9 Å NK-lysin (1nkl)Bacteriocin T102/as48 (1e68)T102 best model Other measures include contact maps and torsion angle RMSDs

12 Comparative modeling In general, accuracy of structure prediction depends on the percent amino acid identity shared between target and template. For >50% identity, RMSD is often only 1 Å.

13 Many web servers offer comparative modeling services. Examples are SWISS-MODEL (ExPASy) Predict Protein server (Columbia) WHAT IF (CMBI, Netherlands) Comparative modeling

14 Ab Initio Methods Ab initio: “From the beginning”. Assumption 1: All the information about the structure of a protein is contained in its sequence of amino acids. Assumption 2: The structure that a (globular) protein folds into is the structure with the lowest free energy. Finding native-like conformations require: - A scoring function (potential). - A search strategy.

15 Ab initio prediction can be performed when a protein has no detectable homologs. Protein folding is modeled based on global free-energy minimum estimates. Ab initio protein structure prediction

16 Ab initio Prediction Sampling the global conformation space –Lattice models / Discrete-state models –Molecular Dynamics Picking native conformations with an energy function –Solution model: how protein interacts with water –Pair interactions between amino acids Predicting secondary structure –Local homology –Fragment libraries

17 ROSETTA ROSETTA is mainly an ab initio structure prediction algorithm, although various parts of it can be used for other purposes as well (such as homology modeling). Rationale –Local structures often fold independently of full protein –Can predict large areas of protein by matching sequence to I-Sites David Baker

18 Ab initio Prediction – ROSETTA 1.PSI-BLAST – homology search Discard sequences with >25% homology 2.PHD For each 3-long and each 9-long sequence fragment, get 25 structure fragments that match “well” 3.Markov-Chain Monte Carlo method Insert and remove iteratively one short structure fragment at a time ? ? ?

19 Ab initio Prediction

20 Protein Threading The goal: find the “correct” sequence-structure alignment between a target sequence and its native-like fold in PDB Energy function – knowledge (or statistics) based rather than physics based –Should be able to distinguish correct structural folds from incorrect structural folds –Should be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments MTYKLILN …. NGVDGEWTYTE

21 Threading Threading is in-between homology-based prediction and molecular modeling MTYKLILN …. NGVDGEWTYTE Main difference between homology-based prediction and threading: Threading uses the structure to compute energy function during alignment

22 Threading – Overview Build a structural template database Define a sequence–structure energy function Apply a threading algorithm to query sequence Perform local refinement of secondary structure Report best resulting structural model

23 Threading – Template Database FSSP, SCOP, CATH Remove pairs of proteins with highly similar structures –Efficiency –Statistical skew in favor of large families

24 Threading – Energy Function MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE how well a residue fits a structural environment: E s how preferable to put two particular residues nearby: E p alignment gap penalty: E g total energy: w m E m + w s E s + w p E p + w g E g + w ss E ss how often a residue mutates to the template residue: E m compatibility with local secondary structure prediction: E ss

25 Protein Threading -- algorithm Threading algorithm – to find a sequence-structure alignment with the minimum energy –considering only singleton energy and gap penalty –considering all three energy terms sequence fold links

26 Protein Threading -- algorithm Iterative procedures e.g. repeated 3D-profile alignment Double dynamic programming Integer programming

27 Assessing Prediction Reliability MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE Score = -1500Score = -900Score = -1120Score = -720 Which one is the correct structural fold for the target sequence if any? The one with the highest score ?

28 Assessing Prediction Reliability Template #1: AATTAATACATTAATATAATAAAATTACTGA Query sequence: AAAA Template #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA Better template? Which of these two sequences will have better chance to have a good match with the query sequence after randomly reshuffling them?

29 Assessing Prediction Reliability Different template structures may have different background scores, making direct comparison of threading scores against different templates invalid Comparison of threading results should be made based on how standout the score is in its background score distribution rather the threading scores directly

30 Assessing Prediction Reliability Threading 100,000 sequences against a template structure provides the baseline information about the background scores of the template By locating where the threading score with a particular query sequence, one can decide how significant the score, and hence the threading result, is! Not significant significant E-value

31 Assessing Prediction Reliability MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE Score = -1500 E-value = e -1 Score = -900 E-value = e -21 Score = -1120 E-value = 0.5 e -1 Score = -720 E-value = e -2 If no predictions have non-significant e- values, a prediction program should indicate that it could not make a prediction!

32 Prediction of Protein Structures Threading against a template database Select the hits with good e-values, e.g., < e -10 Put the backbone atoms in the backbone into the corresponding positions in the aligned residues FMFTAIGEEVVQRSRKIL- - - DDLVELVK AVLTRYGQRLIQLYDLLAQIQQKAFDVLS Unaligned residues will not have 3D coordinates

33 Prediction of Protein Structures Protein threading can predict only the backbone structure of a protein (side-chains have to be predicted using other methods) Typically the lower the e-value, the higher the prediction accuracy Blue: actual structure Green: predicted structure predictedactual

34 Prediction of Protein Structures Examples – a few good examples actualpredicted actual predicted

35 Prediction of Protein Structures Not so good example

36 Prediction of Protein Structures State of the art: ~50% of the soluble proteins in a microbial genome could have correct fold prediction and might be 50% of these proteins have good backbone structure prediction Functional inference could be made based on –accurately predicted structures: –correctly identified structural folds:

37 Prediction of Protein Structures All-atom structures could be predicted through prediction of –prediction of backbone structure –prediction of sidechain packing Backbone-dependent rotamers Ab initio prediction of sidechains State of the art – accurate prediction of side chains remains a challenging problem

38 Structure prediction using additional information Some structural information may be available before whole structure is solved –disulfide bonds –active sites –residues identified buried/exposed –(partial) secondary structure –partial NMR data –inter-residual distances by cross-linking and mass spec –overall shape derived from cryo-EM –……. These data can provide highly useful constraints on threading prediction

39 Structure prediction using additional information The basic idea MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE Distance or other types of constraints could be derived before the structure is solved, which could help to the structure prediction more accurate

40 Applications Many protein structures have been successfully predicted prior to the solution of their experimental structures (and later were verified by experimental structures) Structure predictions of all predicted genes in three microbial genomes, Synechococcus, Procholorococcus MIT/MED ~60% of predicted genes have structural fold assignments

41 Existing Prediction Programs PROSPECT –https://csbl.bmb.uga.edu/protein_pipelinehttps://csbl.bmb.uga.edu/protein_pipeline FUGU –http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.htmlhttp://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html THREADER –http://bioinf.cs.ucl.ac.uk/threader/http://bioinf.cs.ucl.ac.uk/threader/

42 CASP: Critical Assessment of Structure Prediction A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction, John Moult First held in 1994, every 2 years afterwards Teams make structure predictions from sequences alone

43 CASP Two categories of predictors –Automated Automatic Servers, must complete analysis within 48 hours Shows what is possible through computer analysis alone –Non-automated Groups spend considerable time and effort on each target Utilize computer techniques and human analysis techniques

44 CAFASP GOAL The goal of CAFASP is to evaluate the performance of fully automatic structure prediction servers available to the community. In contrast to the normal CASP procedure, CAFASP aims to answer the question of how well servers do without any intervention of experts, i.e. how well ANY user using only automated methods can predict protein structure. CAFASP assesses the performance of methods without the user intervention allowed in CASP.

45 Performance Evaluation in CAFASP3 Servers (54 in total) Sum MaxSub Score # correct (30 FR targets) 3ds5 robetta5.17-5.2515-17 pmod 3ds3 pmode34.21-4.3613-14 RAPTOR3.9813 shgu3.9313 3dsn3.64-3.9012-13 pcons33.7512 fugu3 orf_c3.38-3.6711-12 ……… pdbblast0.000 (http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released in December, 2002.)http://ww.cs.bgu.ac.il/~dfischer/CAFASP3 Servers with name in italic are meta servers MaxSub score ranges from 0 to 1 Therefore, maximum total score is 30

46 One structure where RAPTOR did best Red: true structure Blue: correct part of prediction Green: wrong part of prediction Target Size:144 Super-imposable size within 5A: 118 RMSD:1.9

47 Some more results by other programs

50 Summary of current state of the art

51 Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction aims at defining the state of each amino acid ai as being either H (helix), E (extended=strand), or O (other) (Some methods have 4 states: H, E, T for turns, and O for other).

52 Measures used to evaluated secondary structure predictions Percentage of residues predicted ("PP") Percentage of residues for which secondary structure prediction was made (residues were assigned secondary structure with nonzero probability). The number is provided for the reference.

53 Measures used to evaluated secondary structure predictions Qindex: Qindex (Qhelix, Qstrand, Qcoil, Q3) gives percentage of residues predicted correctly as helix(H), strand(E), coil(C) or for all three conformational states. Qhelix ("Q_H") Qstrand("Q_S") Qcoil("Q_C") Q3 ("Q3") –

54 Qindex For a single conformational state: where i is either helix, strand or coil. For all three states:

55 Limitations of Q 3 ALHEASGPSVILFGSDVTVPPASNAEQAK hhhhhooooeeeeoooeeeooooohhhhh ohhhooooeeeeoooooeeeooohhhhhh hhhhhoooohhhhooohhhooooohhhhh Amino acid sequence Actual Secondary Structure Q3=22/29=76% (useful prediction) (terrible prediction) Q3 for random prediction is 33% Secondary structure assignment in real proteins is uncertain to about 10%; Therefore, a “perfect” prediction would have Q3=90%.

56 Early methods for Secondary Structure Prediction Chou and Fasman (Chou and Fasman. Prediction of protein conformation. Biochemistry, 13: 211-245, 1974) GOR (Garnier, Osguthorpe and Robson. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol., 120:97- 120, 1978)

57 Chou and Fasman Start by computing amino acids propensities to belong to a given type of secondary structure: Propensities > 1 mean that the residue type I is likely to be found in the Corresponding secondary structure type.

58 Amino Acid  -Helix  -SheetTurn Ala 1.29 0.900.78 Cys 1.11 0.740.80 Leu 1.30 1.020.59 Met 1.47 0.970.39 Glu 1.44 0.751.00 Gln 1.27 0.800.97 His 1.22 1.080.69 Lys 1.23 0.770.96 Val 0.91 1.490.47 Ile 0.97 1.450.51 Phe 1.07 1.320.58 Tyr 0.72 1.251.05 Trp 0.99 1.140.75 Thr 0.82 1.211.03 Gly 0.56 0.921.64 Ser 0.82 0.951.33 Asp 1.04 0.721.41 Asn 0.90 0.761.23 Pro 0.52 0.641.91 Arg 0.96 0.990.88 Chou and Fasman Favors  -Helix Favors  -strand Favors turn

59 Chou and Fasman Predicting helices: - find nucleation site: 4 out of 6 contiguous residues with P(  )>1 - extension: extend helix in both directions until a set of 4 contiguous residues has an average P(  ) < 1 (breaker) - if average P(  ) over whole region is >1, it is predicted to be helical Predicting strands: - find nucleation site: 3 out of 5 contiguous residues with P(  )>1 - extension: extend strand in both directions until a set of 4 contiguous residues has an average P(  ) < 1 (breaker) - if average P(  ) over whole region is >1, it is predicted to be a strand

60 Chou and Fasman Position-specific parameters for turn: Each position has distinct amino acid preferences. Examples: -At position 2, Pro is highly preferred; Trp is disfavored -At position 3, Asp, Asn and Gly are preferred -At position 4, Trp, Gly and Cys preferred f(i) f(i+1) f(i+2) f(i+3)

61 Chou and Fasman Predicting turns: - for each tetrapeptide starting at residue i, compute: - P Turn (average propensity over all 4 residues) - F = f(i)*f(i+1)*f(i+2)*f(i+3) - if P Turn > P  and P Turn > P  and P Turn > 1 and F>0.000075 tetrapeptide is considered a turn. Chou and Fasman prediction: http://fasta.bioch.virginia.edu/fasta_www/chofas.htm

62 The GOR method Position-dependent propensities for helix, sheet or turn is calculated for each amino acid. For each position j in the sequence, eight residues on either side are considered. A helix propensity table contains information about propensity for residues at 17 positions when the conformation of residue j is helical. The helix propensity tables have 20 x 17 entries. Build similar tables for strands and turns. GOR simplification: The predicted state of AAj is calculated as the sum of the position- dependent propensities of all residues around AAj. GOR can be used at : http://abs.cit.nih.gov/gor/ (current version is GOR IV)http://abs.cit.nih.gov/gor/ j

63 Accuracy Both Chou and Fasman and GOR have been assessed and their accuracy is estimated to be Q3=60-65%. (initially, higher scores were reported, but the experiments set to measure Q3 were flawed, as the test cases included proteins used to derive the propensities!)

64 -Available servers: - JPRED : http://www.compbio.dundee.ac.uk/~www-jpred/http://www.compbio.dundee.ac.uk/~www-jpred/ - PHD:http://cubic.bioc.columbia.edu/predictprotein/http://cubic.bioc.columbia.edu/predictprotein/ - PSIPRED: http://bioinf.cs.ucl.ac.uk/psipred/http://bioinf.cs.ucl.ac.uk/psipred/ - NNPREDICT: http://www.cmpharm.ucsf.edu/~nomi/nnpredict.htmlhttp://www.cmpharm.ucsf.edu/~nomi/nnpredict.html - Chou and Fassman: http://fasta.bioch.virginia.edu/fasta_www/chofas.htmhttp://fasta.bioch.virginia.edu/fasta_www/chofas.htm Secondary Structure Prediction

1 Protein structure Prediction. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics.

Similar presentations

Presentation on theme: "1 Protein structure Prediction. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Protein structure Prediction. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics.

Similar presentations

Presentation on theme: "1 Protein structure Prediction. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics."— Presentation transcript:

Similar presentations

About project

Feedback