Download presentation
1
Structure Prediction in 1D
[Based on Structural Bioinformatics, chapter 28] .
2
Protein Structure Amino-acid chains fold to form 3d structures
Proteins are sequences that have (more or less) stable 3-dimensional configuration Structure is crucial for function: Area with a specific property Enzymatic pockets Firm structures
3
Levels of structure: primary structure
4
Levels of structure: secondary structure
α helix β sheet David Eisenberg, PNAS 100:
5
Levels of structure: tertiary and quaternary structure
Four levels of structure in hemagglutinin, which is a long multimeric molecule whose three identical subunits are each composed of two chains, HA1 and HA2. (a) Primary structure is illustrated by the amino acid sequence of residues 68 –195 of HA1. This region is used by influenza virus to bind to animal cells. The one-letter amino acid code is used. Secondary structure is represented diagrammatically beneath the sequence, showing regions of the polypeptide chain that are folded into α helices (light blue cylinders), β strands (light green arrows), and random coils (white strands). (b) Tertiary structure constitutes the folding of the helices and strands in each HA subunit into a compact structure that is 13.5 nm long and divided into two domains. The membrane-distal domain is folded into a globular conformation. The blue and green segments in this domain correspond to the sequence shown in part (a). The proximal domain, which lies adjacent to the viral membrane, has a stemlike conformation due to alignment of two long helices of HA2 (dark blue) with β strands in HA1. Short turns and longer loops, which usually lie at the surface of the molecule, connect the helices and strands in a given chain. (c) The quaternary structure comprises the three subunits of HA; the structure is stabilized by lateral interactions among the long helices (dark blue) in the subunit stems, forming a triple-stranded coiled-coil stalk. Each of the distal globular domains in trimeric hemagglutinin has a site (red) for binding sialic acid molecules on the surface of target cells. Like many membrane proteins, HA has several covalently bound carbohydrate (CHO) chains.
6
Ramachandran Plot
7
Determining structure: X-ray crystallography
8
Determining structure: NMR spectroscopy
9
Determining Structure
X-Ray and NMR methods allow to determine the structure of proteins and protein complexes These methods are expensive and difficult [several months to process one protein] A centralized database (PDB) contains all solved protein structures ( XYZ coordinate of atoms within specified precision ~31,000 solved structures
10
Sequence from structure
All information about the native structure of a protein is coded in the amino acid sequence + its native solution environment. Can we decipher the code? No general prediction of 3d from sequence yet. Anfinsen, 1973
11
One dimensional prediction
Project 3d structure onto strings of structural assignments A simplification of the prediction problem Examples: Secondary structure state for each residue [α, β, L] Accessibility of each residue [buried, exposed] Transmembrane helix
12
Define secondary structure
3D protein coordinates may be converted into a 1D secondary structure representation using DSSP or STRIDE DSSP EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT DSSP = Database of Secondary Structure in Proteins STRIDE = Secondary STRucture IDEntification method 2
13
Labeling Secondary Structure
Use both hydrogen bond patterns and backbone dihedral angles to label secondary structure tags from XYZ coordinate of amino-acids Do not lead to absolute definition of secondary structure Why not absolutely defined??
14
Prediction of Secondary Structure
Input: Amino-acid sequence Output: Annotation sequence of three classes [alpha, beta, other (sometimes called coil/turn)] Measure of success: Percentage of residues that were correctly labeled
15
Accuracy of 3-state predictions
True SS: EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT Prediction: EEEELLLLHHHHHHLLLLEEEEEHHHHHHHHHHHHHHHHHHLL Q3-score = % of 3-state symbols that are correctly measured on a "test set" Test set = An independent set of cases (proteins) that were not used to train, or in any way derive, the method being tested. Best methods PHD (Burkhard Rost): 72-74% Q3 Psi-pred (David T. Jones): 76-78% Q3 5
16
What can you do with a secondary structure prediction?
Find out if a homolog of unknown structure is missing any of the SS (secondary structure) units, i.e. a helix or a strand. Find out whether a helix or strand is extended or shortened in the homolog. Model a large insertion or terminal domain Aid tertiary structure prediction 9
17
Statistical Methods From PDB database, calculate the propensity for a given amino acid to adopt a certain ss-type Example: #Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500 P(a,aa) = 500/20,000, p(a) = 4,000/20,000, p(aa) = 2,000/20,000 P = 500 / (4,000/10) = 1.25 Used in Chou-Fasman algorithm (1974)
19
Chou-Fasman: Initiation
Identify regions where 4/6 have propensity P(H) > 1.00 This forms a “alpha-helix nucleus”
20
Chou-Fasman: Propagation
Extend helix in both directions until a set of four residues have an average P(H) <1.00.
21
Chou-Fasman Prediction
Predict as -helix segment with E[P] > 1.03 E[P] > E[P] Not including Proline Predict as -strand segment with E[P] > 1.05 E[P] > E[P] Others are labeled as turns/loops. (Various extensions appear in the literature)
22
Achieved accuracy: around 50%
Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids We would like to use the sequence context as an input to a classifier There are many ways to address this. The most successful to date are based on neural networks
23
A Neuron
24
… Artificial Neuron Input Output a1 a2 ak
W1 a2 W2 … Wk ak A neuron is a multiple-input, single output unit Wi = weights assigned to inputs; b = internal “bias” f = output function (linear, sigmoid)
25
Artificial Neural Network
Input Hidden Output a1 o1 a2 … … … om ak Neurons in hidden layers compute “features” from outputs of previous layers Output neurons can be interpreted as a classifier
26
Example: Fruit Classifer
Yellow Light Soft Round Orange Red Heavy Hard Ellipse Apple Color Weight Texture Shape
27
Qian-Sejnowski Architecture
... Si-w o o Si oo Si+w Input Hidden Output
28
Neural Network Prediction
A neural network defines a function from inputs to outputs Inputs can be discrete or continuous valued In this case, the network defines a function from a window of size 2w+1 around a residue to a secondary structure label for it Structure element determined by max(o, o, oo)
29
Training Neural Networks
By modifying the network weights, we change the function Training is performed by Defining an error score for training pairs <input,output> Performing gradient-descent minimization of the error score Back-propagation algorithm allows to compute the gradient efficiently We have to be careful not to overfit training data
30
Smoothing Outputs Some sequences of secondary structure are impossible: To smooth the output of the network, another layer is applied on top of the three output units for each residue Success rate: about 65% on unseen proteins
31
Breaking the 70% Threshold
An innovation that made a crucial difference uses evolutionary information to improve prediction Key idea: Structure is preserved more than sequence Surviving mutations are not random Exploit evolutionary information, based on conservation analysis of multiple sequence alignments.
32
Nearest Neighbor Approach
Predict the secondary structure state, based on the secondary structure of homologous segments from proteins with known 3d structure. A key element: the choice of scoring table for evaluation of segment similarity. Use max (na, nb, nc) [NNSSP: Nearest-Neighbor Secondary Structure Prediction] Where na, nb, nc are the numbers of nearest neighbors with the helix, strand and coil types, respectively.
33
[The PredictProtein server]
PHD Approach Perform BLAST search to find local alignments Remove alignments that are “too close” Perform multiple alignments of sequences Construct a profile (PSSM) of amino-acid frequencies at each residue Use this profile as input to the neural network A second network performs “smoothing” The third level computes jury decision of several different instantiations of the first two levels. [The PredictProtein server] The first level is composed of ANNs almost identical to the architecture described above. However, there are differences in the input representation and the fact that multiple ANNs are used, each with their input window shifted one residue. Rather than simply using a binary representation to signify which amino acids are in a sequence, PHD encodes the input signal using aligned sequences. For example, by doing a multiple alignment on the query sequence, each position along the query sequence will have some number of aligned residues from matched sequences. PHD takes the frequency of occurrence of an amino acid in the aligned sequences and transforms that into an analog value input into the ANN. The outputs from all of the level 1 ANNs are then fed into a single second level ANN. The level 2 ANN also consists of a hidden layer and three output units giving the likelihood that the central residue is an α-helix, β-sheet, or loop/coil. (It should be noted that the use of the term “likelihood” does not imply statistical significance but simply some measure of conformational state strength on a scale from 0 to 1.) The second level, therefore, maps secondary structure information from its inputs to secondary structure information at its output. The third level of PHD computes an arithmetic average or jury decision of several different instantiations of the first two levels. One instantiation added “conservation weights” to the sequence profiles. These weightings placed a higher emphasis on particularly well conserved positions. Another instantiation used a slightly different representation method for the input vectors. The third level, therefore, provides a jury decision on the likelihood of secondary structure based on many different versions of the level 1 and level 2 ANNs.
34
Psi-pred : same idea (Step 1) Run PSI-Blast --> output sequence profile (Step 2) 15-residue sliding window = 315 values, multiplied by hidden weights in 1st neural net. Output is 3 values (a weight for each state H, E or L) per position. (Step 3) 60 input values, multiplied by weights in 2nd neural network, summed. Output is final 3-state prediction. 315 = 15* because you have 20 possible aa + 1 value to indicate a “missing value” near the 3’ or 5’ end of the sequence. 60 = 15*4 again 4 is for 3 possible structure prediction + 1 “missing value” for the window in the edges Performs slightly better than PHD 7
35
Other Classification Methods
Neural Networks were used as a classifier in the described methods. We can apply the same idea, with other classifiers, e.g.: SVM Advantages: Effectively avoid over-fitting Supplies prediction confidence [S. Hua and Z. Sun, (2001)]
36
Secondary Structure Prediction - Summary
1st Generation s Chou & Fausman, Q3 = 50-55% 2nd Generation -1980s Qian & Sejnowski, Q3 = 60-65% 3rd Generation s PHD, PSI-PRED, Q3 = 70-80% Failures: Long term effects: S-S bonds, parallel strands Chemical patterns Wrong prediction at the ends of H/E
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.