2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
The Truth (Information) is Out (In) There
But we’re still having a tough time finding it.
Given a protein sequence (primary structure), predict its secondary structures HWIATGQLIREAYEDYSS GHWIATRGQLIREAYEDYRHFSSECPFIP EEEEECCEEEEECCCHHHH CEEEEECCCEEEEECCCHHHHHHCCCCCC E: -strand H: -helix C: coil Assumption: short stretches of residues have propensity to adopt certain conformation ⇒ conformation of the central residue in a sequence fragment depends only on flanking residues (sliding window) Protein Secondary Structure Prediction H: ( H: - helix, G: 3 10 helix, I: -helix ) E: (E: -strand, B: bridge) C: (T: -turn, S: bend, C: coil)
-- Because we can (kind of). --Because it could be a first step towards prediction of protein tertiary structure. Why secondary structure prediction? “Have solution, need problem.” Nearly every imaginable algorithm has been applied to secondary structure prediction.
1. First generation: Single amino acid propensities Chou-Fasman method (1974), GOR I-IV ~56-60% accuracy 2. Second generation: Segments of 3-51 adjacent residues NNSSP, SSPAL ~65% accuracy 3. Neural network PHD, Psi-Pred, J-Pred 4. Support vector machine (SVM) 5. Hidden Markov Models (HMM) Third generation methods using evolutionary information ~76% accuracy Secondary Structure Prediction Methods
1. three-state per-residue prediction accuracy M ii, number of residues observed in state i and predicted in state i N obs, the total number of residues observed in 3 states Secondary Structure Prediction Accuracy 2. per-segment prediction accuracy (SOV, Segment of OVerlap) Per-stage segment overlap: S1: observed SS segment S2: predicted SS segment
Calculate the propensity for a given amino acid to adopt a certain ss-type l Example: from a data set with 30 proteins #Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=580 p( ,aa) = 580/20,000, p( ) = 4,000/20,000, p(aa) = 2,000/20,000 P = 580 / (4,000/10) = 1.45 i, amino acid , secondary structure state Single Residue Propensity Methods
Amino Acid Propensities to Secondary Structures Chou-Fasman method
* The idea is simple: predict SS of the central residue of a given segment from homologous segments (neighbors). For example, from database, find some number of the closest sequences to a subsequence defined by a window around the central residue, then use max (N , N , Nc) to assign the SS. Nearest Neighbor Methods RSTEVRASRQLAKEKVN Window size Homologous sequences ECCHHCCECCHHCC C Key parameters: 1.How to define similarity? 2.What size window of sequence should be examined? 3.How many close sequences should be selected?
The Devil is in the details…
D. Jones, J. Mol. Boil. 292, 195 (1999). Method : Neural network Input data : PSSM generated by PSI-BLAST Bigger and better sequence database Combining several database and data filtering Training and test sets preparation Ss prediction only makes sense for proteins with no homologous structure. No sequence & structural homologues between training and test sets by CATH and PSI-BLAST (mimicking realistic situation). Psi-Pred Method
Window size = 15 Two networks First network (sequence-to-structure): 315 = (20 + 1) 15 inputs extra unit to indicate where the windows spans either N or C terminus Data are scaled to [0-1] range by using 1/[1+exp(-x)] 75 hidden units 3 outputs (H, E, L) Second network (structure-to-structure): Structural correlation between adjacent sequences 60 = (3 + 1) 15 inputs 60 hidden units 3 outputs Accuracy ~76% Psi-Pred Method--Neural Network
Conf: Confidence (0=low, 9=high) ---very important!!!! Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence # PSIPRED HFORMAT (PSIPRED V2.3 by David Jones) Conf: Pred: CCHHHHHHHHHHHHHHHCCCCCCCHHHHHHHHHHHCCCCCEEECCCCEEEEEEECCCCCC AA: MMWEQFKKEKLRGYLEAKNQRKVDFDIVELLDLINSFDDFVTLSSCSGRIAVVDLEKPGD Conf: Pred: CCCCEEEEEECCCCCHHHHHHHHHCCCCCEEEEECCCEEEEECCCHHHHHHHHHHHHHCC AA: KASSLFLGKWHEGVEVSEVAEAALRSRKVAWLIQYPPIIHVACRNIGAAKLLMNAANTAG Conf: Pred: CCCCCCEECCCEEEEEECCCEEEEEECCCCCEEECHHHHHHHHHHHHHHHHHHHHHHHHH AA: FRRSGVISLSNYVVEIASLERIELPVAEKGLMLVDDAYLSYVVRWANEKLLKGKEKLGRL Sample Psi-Pred Output ***Compare the prediction for residues 9 and 17***
Sample Psi-Pred Output-II
Again, voting rules methods tend to be best ATKAVCVLKGDGPVQGTIHFEAKGDTVVVTGSITGLTEGDHGFHVHQFGDNTQGCTSAGP 2SOD CCCCCCCCCCCCCCCCEEHCCHHECEEEEEEEEEEEECCCCCCCCCCCCCCCCCCCCCCC BPS CCHEEEEECCCCCCCCEEEHHHCCCEEEEEEEEECECCCCCCEEEECCCCCCCCCCCCCC D_R CCCEEEEEECCCCCEEEEEEEECCCEEEEEEEEEEEECCCCCEEEEECCCCCCCCCCCCC DSC CCCEEEEECCCCCCCEEEEEECCCCEEEEEEEEECCCCCCCCEEEEEECCCCCCCCCCCC GGR HHHCEEEECCCCCCCEEEEEECCCCEEEEEECEEEEEECCCCEEEEECCCCCCEEECCCC GOR CCCCEEEECCCCCCCCCEEECCCCCCEEEEECEEECCCCCCCEEEECCCCCCCCEEECCC H_K CCCCEEEEECCCCCCCCCEEECCCCCEEEECCCCCCCCCCCEEEEEEEECCCCCCCCCCC K_S CCCCEEEECCCCCCCCEEEEECCCCEEEEEEEEEEECCCCCCEEEEECCCCCCCCCCCCC JOI ---EEEEE------EEEEEEEEE--EEEEEEEEE-----EEEEEEEE SOD HFNPLSKKHGGPKDEERHVGDLGNVTADKNGVAIVDIVDPLISLSGEYSIIGRTMVVHEK 2SOD CCCCCCCCCCCCCCCCCCCCCCECCCCCCHEECCCCCCCCCECCEECEEEEEEEEEEECC BPS CCCCCCCCCCCCCCCHHCECCCCCECCCCCCEEEEEEECCEEEECCCEEEEEEEEEEECC D_R CCCCCCCCCCCCCCEEEEECCCCCCCCCCCCEEEEEECCCCCCCCCCEEEEEEEEEEECC DSC CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEECCCCCCCCCCEEEECEEEEEECC GGR CCCCCCCCCCCCCCHHEEECCCCCCCCCCCCEEEEEEECCEEECCCCEEEEEEEEEECCC GOR CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEECCCCCCCCCCCCCCHHHHHHEECCC H_K CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEEEEEEEEEECCCEEECCEEEEEEE K_S CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCEEEEEECCCCECCCCCEEEEEEEEEEECC JOI EEEEEE------EEEEEEE EEEEE-- 2SOD
Prediction Accuracy (EVA) EVA: Automatic evaluation of prediction servers
Currently ~76% Proteins with more than 100 homologues 80% Assignment is ambiguous (5-15%). Recall DSSP vs STRIDE. -- non-unique protein structures (dynamic), H-bond cutoff, etc. Different secondary structures between homologues (~12%). Non-locality. Secondary structure is influenced by long-range interactions. -- Some segments can have multiple structure types (chameleon sequences). How Far Can We Go?
Conceptually similar problem to SS prediction: Buried vs. Exposed. Weighted Ensemble Solvent Accessibility predictor: Solvent accessibility E E E E E E B B B B B B
To provide structural context for putative mutations that one wants to characterize biochemically or biophysically. Why bother?
Again, conceptually similar problem to SS prediction: TM vs. Not. Transmembrane Segment Prediction