Download presentation
Presentation is loading. Please wait.
Published byTheodore Scott Modified over 8 years ago
1
Predicting Structural Features Chapter 12
2
Structural Features Phosphorylation sites Transmembrane helices Protein flexibility
3
Accuracy Measures Revisited Level: –Individual residues –Complete helix or strand
4
Q 3 –Percentage of residues predicted correctly –If one state (eg, Coil) is very common (eg, 50%), blind guessing can give a large Q 3 ! Matthew’s correlation coefficient –C= (TPxTN - FNxFP)/√(TP+FP)(TP+FN)(TN+FP)(TN+FN) –Defined for each state –More balanced than Q3; in range ±1 –Random prediction: C = 0 Residue-Level Measures
5
Structural Element-Level Measures SOV –based on the overlap of predicted “segments” of helix, strand etc. with the observed segments of the same type The N-score –specialized for transmembrane protein predictors –Should TMHMM2 be changed? Should your model?
6
Predicting Helices Residue propensities: –score for a given structure class for each residue, a P(H | a) is proportional to P(a | H) / P(a) Why? Bayes’ Rule is your friend! –P(H | a) = P(a | H)P(H) / P(a) –P(H) doesn’t depend on a, so –P(H | a) proportional to P(a | H) / P(a) Can this be used to see how to group helix states?
7
Identical short segments rarely fold differently Local sequence is highly important to secondary structure. But, this sequence occurs in two proteins and takes very different forms: –KGVVPQLVK There is significant information about structure in local sequence.
8
I-sites Sequence Database About 250 short segments (3-19 residues) that show strong correlation between sequence and structure Example shows: –phi and psi angles, log-odds matrix –superimposed backbones –representative structure
9
Nearest Neighbor Prediction Methods Predict secondary structure based on: –Local alignments of the query sequence to a database of sequences of known structure –Alignment score functions are often special-purpose, and may include helix/sheet/coil “propensity” information –Homologous sequences are often included in the database Prediction based on weighted votes of nearest neighbors (usually only central residue of alignment is predicted) 73.5% Accuracy (Q 3 )
10
A different application: prediction of misfolding Diseases such as Alzheimer’s involve protein misfolding. Usually, the misfolded region ends up as Beta-strands. How could we use secondary structure information to predict which proteins will potentially misfold?
11
H P Hidden Beta Propensity Key idea: Tertiary contacts (TC) –TC is number of contacts a residue has with others at least 4 residues away –Alpha helices tend to be in regions of HIGH TC –Beta strands tend to be in regions of LOW TC Look for query residues whose nearest neighbors are “strange” with respect to TC and alpha/beta state: –Low TC regions with lots of Alphas –High TC regions with lots of Betas Performance results?
12
Neural Nets Each node computes a simple function of its inputs. The weighted sum of the inputs are added to a bias term and “squashed”: –I = w -1 – (I+ ) The output, , is then propagated to nodes in the next layer.
13
Training Neural Nets Back-propagation Optimizes the weights and bias terms Minimize the error function (difference between predicted and observed) –RMS –Relative Entropy Iterative process –Final weights shown for a secondary structure NN alpha helix output layer. –Over-fitting can be reduced by training for fewer iterations
14
Adaptive Encoding and Weight Sharing Orthogonal encoding Each residue feeds three hidden nodes The weights for all red nodes are tied together Each group of three nodes learns the same “encoding” of the 20 amino acids
15
Engineering Intuition Into NNs Alpha helices have a period of 3.6 residues per turn A NN can be specially designed to reflect that Using this, plus adaptive encoding: –Q 3 = 66% –Adding homology: Q 3 = 73%
16
HMMs and Transmembrane Proteins (again)
17
HMMTOP Architecture TMHs 17-25 residues Tails 1-15 residues Blue letters show structural state labels
18
TMHMM Architecture Helices are 5-25 residues Caps follow helices Cytoplasmic: –Loop: 0-20 residues –Globular: 1 state Extra-cellular: –Long loop: 0-100 residues –Globular: 3 states
19
Predicting Globular Proteins with “Hidden Neural Networks” YASPIN –Neural net predicts seven classes (He,H, Hb,C,Ee,E,Eb) using 15-residue window of PSSM input –HMM “filters” this output –Can you imagine how this is done?
20
Coiled-coil HMM MARCOIL Design lets you start and end in any phase of the heptad repeat
21
Support Vector Machines: SVMs Classifiers –Basic “machine” is a 2-class classifier –Training Data set of labeled vectors { }, Class: C=1 or C=-1 –Supervised learning (like neural nets) Learn from positive and negative examples –Output Function predicting class of unlabeled vectors
22
SVM Example Alpha helix predictor –15 residue window –21 numbers per residue Psi-BLAST PSSM: 20 numbers “spacer” flag indicating “off end” of protein –315 numbers total per window –Training samples Non-helix samples: { } Helix samples: { } –Training finds function of X that best separates the non-helix from the helix samples
23
SVM vs NN as Classifiers Similarities –Compute a function on their inputs –Trained to minimize error Differences –NNs find any hyperplane that separates the two clases –SVMs find the maximum- margin hyperplane –NNs can be engineered by designing their topology –SVMs can be tailored by designing the kernel function
24
SVM Details Choose w, b to minimize ||w|| Subject to s.t. Dual form (support vectors) Separating Hyperplanes: where Kernel trick: replace dot products by a non-linear kernel bunction.
25
Dubious Statement “In marked contrast to NN, SVMs have few explicit parameters to fit…” –The vector of weights, w, is as long as the number of training samples –But the minimum-margin hyperplane will have most of the weights equal to zero; only the “support vectors” will have non- zero weights.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.