Download presentation
1
Protein Structure Modelling
Structural Bioinformatics II Protein Structure Modelling R.S.K. Vijayan ,
2
Overview of todays lecture
Levels of Protein structure Protein Structure Prediction Secondary Structure Prediction Chou-Fasman Method GOR Method NN based methods Tertiary Structure Prediction ab inito based methods Challenges Limitations Overview of Rostetta Method Overview of CASP and CAMEO
3
Levels of Protein Structure
There are four levels of protein structure. Primary structure (10) Secondary structure (20) Super secondary structure, folds and domains Tertiary structure (30) Quaternary structure (40) The primary structure of protein refers to the amino acid sequence of the polypeptide chain.
4
Secondary structure in Proteins
Secondary structure is the general three-dimensional form of local segments of proteins The Dictionary of Protein Secondary Structure (DSSP) is commonly used to describe the protein secondary structure with single letter codes. There are eight different types of secondary structure G = 3-turn helix (310 helix). Min length 3 residues. H = 4-turn helix (α helix). Min length 4 residues. I = 5-turn helix (π helix). Min length 5 residues (Extremely rare) T = hydrogen bonded turn (3, 4 or 5 turn) E = extended β strand (parallel and/or anti-parallel). Min length 2 residues. B = residue in isolated β-bridge (single pair β-sheet hydrogen bond formation) S = bend (the only non-hydrogen-bond based assignment). C = coil (residues which are not in any of the above conformations). The principal number in the helix notation denotes the number of residues per turn and the subscript tells the number of atoms in the ring formed by closing the hydrogen bond
5
Protein Tertiary Structure
Tertiary structure refers to the three-dimensional structure of the entire polypeptide chain The tertiary structure is defined by its atomic coordinates and is determined using techniques such as X-ray crystallography, NMR spectroscopy, and Cyro-EM. The function of a protein depends on its tertiary structure. Function Sequence Structure
6
Quaternary Structure Many proteins are made up of a single, continuous polypeptide chain (monomeric). Some proteins contain two or more polypeptide chains called subunits/chains (multimeric). Quaternary structure describes the arrangement of two or more subunits/chains, to form one integral structure in a multiunit protein The arrangement of the subunits gives rise to a stable structure It includes organizations from simple dimers to large homooligomers and complexes Subunits may be identical (Homo) or different (Hetero) GABAA Ion Channel- Hetero pentamer HIV Protease - Homo dimer
7
Levels of Protein Structure
8
Deciphering the Protein Folding Code
Protein folding problem the "holy grail" of modern biological Research Given an amino acid sequence, predict its 3D structure (Forward folding problem) How proteins fold so quickly ? Leventhial paradox what happens when this process goes awry (when proteins misfold)? Has been studied for more than 4 decades. Still very much an open problem "Inverse Folding" Problem Given a particular 3D structure fold, identify amino acid sequence that can adopt this fold. There will be a number of sequences compatible for a particular target because homologous proteins are known to adopt the same fold. Protein design: rational design of new protein molecules, with the ultimate goal of designing novel function and/or behavior. Bioengineering and biomedical applications.
9
Protein Secondary Structure Prediction
Predicting protein secondary structure from amino acid sequence has been attempted since the late 1950s. Secondary structure prediction methods aim to predict the local secondary structures of proteins based only on knowledge of their primary sequence. Assigning regions of the amino acid sequence as likely alpha helices, beta strands, or turns. The principle behind most secondary structure predictions is to look for patterns of residue conservation that are indicative of secondary structures like those shown above. The early methods suffered from a lack of data. To date, over 20 different secondary structure prediction methods have been developed. Current methods can achieve up 80% overall accuracy for globular proteins. The accuracy of current protein secondary structure prediction methods is assessed in weekly benchmarks such as LiveBench and EVA.
10
Amino-acids Propensity Values
The main criterion for alpha helix preference is that the amino acid side chain should cover and protect the backbone H-bonds in the core of the helix. Ala,Leu,Met,Phe,Glu,Gln,His,Lys,Arg Helix breakers Gly : Side chain H too small to protect H bond Pro: Ridig structure (phi = -60), Side chain linked to alpha N. Asp, Asn, Ser: H-bonding side chains compete directly with backbone H-bonds Large aromatic residues (Tyr, Phe and Trp) and β-branched amino acids (Thr, Val, Ile) are favored to be found in β strands in the middle of β sheets. Because every other side chain in a sheet is pointing in the opposite direction, leaving room for beta-branched side chains to pack. Guzzo AV: The influence of amino acid sequence on protein structure. Biophys J 1965, 5:809–822. Chou and Fasman, Ann. Rev Biochem. 47 258 (1978).
11
PSSP Applications Prediction of protein secondary structure provide information that is useful for a) ab initio structure prediction b) as additional constraint for fold-recognition algorithms. c) help the design of site-directed or deletion mutants that will preserve the native protein structure (where and how to subclone protein fragments for expression). d) For refinement of sequence alignments e) a step toward the goal of understanding protein folding (A hierarchical approach to solve the protein folding problem). f) Identifying protein function Secondary structure elements start to form in specific nucleation point during folding The quality of secondary structure prediction is measured based on Q3 score. The Q3 score is the average of each Qi (i = helix, sheet, loop), where Qi is defined as the percentage of correctly predicted residues in state i to the total number of experimentally observed residues in state i
12
PSSP Algorithms First Generation: Second Generation: Third Generation:
There are three generations in PSSP algorithms: First Generation: Based on statistical information of single amino acids and were limited by the small number of proteins with solved structures. Chow-Fasman, 1974 (First approach): uses a combination of statistical and heuristic rules. GOR, 1978 : Information-theoretic framework. Second Generation: larger database and use of statistics based on windows (segments) of amino acids. Typically a window contains amino acids. The second-level approximation, involving pairs of residues, provides a better model (GOR3) algorithm. (local dependencies). Third Generation: Based on the use of evolutionary information Incorporates multiple sequence alignment to obtain additional information based on the observed patterns in sequence variability, and the location of insertions and deletion
13
Chou and Fasman Algorithm
Start by computing amino acids propensities to belong to a given type of secondary structure Amino Acid -Helix -Sheet Turn Ala Cys Leu Met Glu Gln His Lys Val Ile Phe Tyr Trp Thr Gly Ser Asp Asn Pro Arg Propensities > 1 Favors α -Helix Favors β-strand Favors b-strand Favors turn
14
Chou and Fasman Algorithm (cont...)
Predicting helices: - find nucleation site: 4 out of 6 contiguous residues with P(α) >1. - extension: extend helix in both directions until a set of 4 contiguous residues has an average P(α) < 1 (breaker). - if average P(α) over whole region is >1, it is predicted to be helical. Predicting strands: - find nucleation site: 3 out of 5 contiguous residues with P(β) > 1. - extension: extend strand in both directions until a set of 4 contiguous residues has an average P(β) < 1 (breaker). - if average P(β) over whole region is > 1, it is predicted to be a strand. Any region containing overlapping (α -helical and β-sheet assignments are taken to be helical if the average P(α-helix) > P(β-sheet) for that region. It is a beta sheet if the average P(β-sheet > P(α) for that region.
15
Chou and Fasman Algorithm (cont...)
Predicting turns: - for each tetrapeptide starting at residue i, compute: - PTurn (average propensity over all 4 residues) - P(t) = f(i)*f(i+1)*f(i+2)*f(i+3) - If the averages for the tetrapeptide obey the inequality PTurn > P(α) and PTurn > P(β) and PTurn > 1 and F > then, the tetrapeptide is considered a turn. Position-specific parameters for turn Each position has distinct amino acid preferences. Examples: At position 2, Pro is highly preferred; Trp is disfavored
16
Beware of Q3 Values Its’s important to be aware that the Q3 score can give an overoptimistic estimate of accuracy than might be expected. Because there are only 3 states, even a random guessing would yield a 3-state accuracy (Q3 ) of about 33% assuming that all structures are equally likely. The numbers of residues in helices, strands, and loops in the database are frequently not evenly distributed, with loops usually comprising the greatest proportion. ALHEASGPSVILFGSDVTVPPASNAEQAK hhhhhooooeeeeoooeeeooooohhhhh ohhhooooeeeeoooooeeeooohhhhhh hhhhhoooohhhhooohhhooooohhhhh Amino acid sequence Actual Secondary Structure Q3=22/29=76% Q3=22/29=76% Secondary structure assignment in real proteins is uncertain to about 10% (disagreement between DSSP and STRIDE); Therefore, a “perfect” prediction would have Q3 = 90%.
17
Chou and Fasman Algorithm (cont...)
Advantages of Chou-Fasman: Propensity for a specific conformation is evaluated in the “context” of the flanking residues using simple rules. Disadvantages of Chou-Fasman: Correlations between different positions in the sequence based completely on empirical rules. Ambiguity in the assignment of overlapping regions. Accuracy below 60% (remember 33.3% is the lower limit).
18
GOR Method GOR method (Garnier-Osguthorpe-Robson) is an information theory-based method. GOR method is also based on probability parameters derived from empirical studies of known experimental structures. GOR method takes into account not only the propensities of individual amino acids to form particular secondary structures, but also the conditional probability of the amino acid to form a secondary structure given that its immediate neighbors have already formed that structure. Evaluate each residue PLUS adjacent 8 N-terminal and 8 carboxyl-terminal residues sliding window of 17 residue. Underpredicts β-strand regions. GOR method accuracy Q3 = ~64%
19
GOR Method Position-dependent propensities for helix, sheet or turn has been calculated for all residue types. For each position j in the sequence, eight residues on both sides of the actual position are considered. Statistical information derived from proteins of known structure is stored in three (17X20). Three matrices, one each for α, β, coil A helix propensity table contains info about propensity for certain residues at 17 positions when the conformation of residue j is helical. The predicted state of aaj is calculated as the sum of the position-dependent propensities of all residues around aaj. Suppose aj is the amino acid that we are trying to categorize. GOR looks at the residues aj−8aj− aj aj+7aj+8. Intuitively, it assigns position-dependent probabilities based on what it has calculated from protein databases.
20
GOR Method
21
Third Generation Methods
Use evolutionary information based on multiple sequence alignment and expert methods (Neural Networks ) for perdition. The most important algorithms of today PHD NNPREDICT PSIPRED Due to the improvement of protein information in databases i.e. better evolutionary information, today’s predictive accuracy is ~80%. It is believed that maximum reachable accuracy is 88%. An artificial neural network is composed of many artificial neurons that are linked together according to a specific network architecture. The goal of the neural network is to transform the inputs into meaningful outputs.
22
Tertiary Structure Prediction
Major Techniques Template Based Modeling Homology Modeling Threading Template-Free Modeling Prediction from sequence using first principles ab initio Methods Physics-Based Knowledge-Based Synonyms : de novo modelling, physics based.
23
Overview of ab initio method
Typically ab initio modelling conducts a conformational search under the guidance of a designed energy function. This procedure usually generates a number of possible conformations (structure decoys), and final models are selected from them. Therefore, a successful ab initio modelling depends on three factors: an accurate energy function with which the native structure of a protein corresponds to the most thermodynamically stable state, compared to all possible decoy structures (2) an efficient search method which can quickly identify the low-energy states through conformational search; (3) selection of native-like models from a pool of decoy structures.
24
Overview of ab initio method
Disadvantages: Ab initio prediction - not practical for large sequences (< 100 aa) Computationally very expensive. Currently, the accuracy of ab initio modelling is low and the success is limited to small proteins . Advantages: Can give insights into folding mechanism. Understanding protein misfolding Doesn’t require homologs Only way to model new folds Useful for de novo protein design
25
Challenges in Protein folding
Energetics We don’t know all the forces involved in detail Too computationally expensive BY FAR! ( Folding takes places at the order of micro seconds to milliseconds) Conformational search impossibly large 100 a.a. protein, 2 moving dihedrals, 2 possible positions for each diheral: 2200 conformations! Levinthal’s Paradox Proteins fold in a couple of seconds?? Multiple-minima problem
26
Understanding protein folding via molecular simulation
Advances in computer hardware, software and algorithms have now made it possible to simulate protein folding. Atomistic models has been used for more than decades to address protein folding problem (M. Levitt, A. Warshel 1975). The first ever longtime scale study on protein folding using MD simulation (Peter Kollman 1998) Time scale for protein folding Challenges Accurate force fields Adequate sufficient sampling Robust data analysis.
27
Rosetta Approach The Rosetta Approach (David Baker lab, Univ. of Washington). Performs Monte Carlo search through space of conformations to find minimal energy conformation Rosetta searches structure space by replacing the torsion angles of a fragment in the current model with torsion angles from known structure fragments
28
The Rosetta Approach Given: protein sequence P for each window of length 9 in P assemble a set of structure fragments (using PSI-BLAST) M = initial structure model of P (fully extended conformation) S = score(M) while stopping criteria not met randomly select a fixed width “window” of amino acids from P randomly select a fragment from the list for this window M’ = M with torsion angles in window replaced by angles from fragment S’ = score(M’) if Metropolis criterion(S, S’) satisfied M = M’ S = S’ Return: predicted structure M
29
The Rosetta Scoring Approach
Rosetta scoring function takes into account residue environment (solvation) residue pair interactions (electrostatics, disulfides) strand pairing (hydrogen bonding) strand arrangement into sheets helix-strand packing steric repulsion scoring function search progressively adds terms during search initially on the steric overlap term is used then all but “compactness” terms are used search is initiated from different random seeds for some applications, an atomic-level scoring function is used
30
Critical Assessment of protein Structure Prediction (CASP)
A community-wide, worldwide experiment for protein structure prediction that is held every two years since 1994. Evaluation of the results is carried out in the following prediction categories: Tertiary structure prediction (all CASPs) ( Divided in to Template based and template free method) Secondary structure prediction (dropped after CASP5) Prediction of structure complexes (CASP2 only; a separate experiment CAPRI) residue-residue contact prediction (starting CASP4) disordered regions prediction (starting CASP5) domain boundary prediction (CASP6–CASP8) function prediction (starting CASP6) model quality assessment (starting CASP7) model refinement (starting CASP7) high-accuracy template-based prediction (starting CASP7)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.