Structure Prediction (I): Secondary structure Structure Prediction (I): Secondary structure DNA/Protein structure-function analysis and prediction Lecture 7 Center for Integrative Bioinformatics VU Faculty of Sciences
Protein secondary structure 20 amino acid types A generic residue Peptide bond Alpha-helix Beta strands/sheet 1 MKYNNHDKIR DFIIIEAYMF RFKKKVKPEV 1 MKYNNHDKIR DFIIIEAYMF RFKKKVKPEV 31 DMTIKEFILL TYLFHQQENT LPFKKIVSDL 31 DMTIKEFILL TYLFHQQENT LPFKKIVSDL 61 CYKQSDLVQH IKVLVKHSYI SKVRSKIDER 61 CYKQSDLVQH IKVLVKHSYI SKVRSKIDER 91 NTYISISEEQ REKIAERVTL FDQIIKQFNL 91 NTYISISEEQ REKIAERVTL FDQIIKQFNL 121 ADQSESQMIP KDSKEFLNLM MYTMYFKNII 151 KKHLTLSFVE FTILAIITSQ NKNIVLLKDL 181 IETIHHKYPQ TVRALNNLKK QGYLIKERST 211 EDERKILIHM DDAQQDHAEQ LLAQVNQLLA 241 DKDHLHLVFE Protein primary structure SARS Protein From Staphylococcus Aureus 1 MKYNNHDKIR DFIIIEAYMF RFKKKVKPEV DMTIKEFILL TYLFHQQENT 1 MKYNNHDKIR DFIIIEAYMF RFKKKVKPEV DMTIKEFILL TYLFHQQENT SHHH HHHHHHHHHH HHHHHHTTT SS HHHHHHH HHHHS S SE SHHH HHHHHHHHHH HHHHHHTTT SS HHHHHHH HHHHS S SE 51 LPFKKIVSDL CYKQSDLVQH IKVLVKHSYI SKVRSKIDER NTYISISEEQ 51 LPFKKIVSDL CYKQSDLVQH IKVLVKHSYI SKVRSKIDER NTYISISEEQ EEHHHHHHHS SS GGGTHHH HHHHHHTTS EEEE SSSTT EEEE HHH 101 REKIAERVTL FDQIIKQFNL ADQSESQMIP KDSKEFLNLM MYTMYFKNII EEHHHHHHHS SS GGGTHHH HHHHHHTTS EEEE SSSTT EEEE HHH 101 REKIAERVTL FDQIIKQFNL ADQSESQMIP KDSKEFLNLM MYTMYFKNII HHHHHHHHHH HHHHHHHHHH HTT SS S SHHHHHHHH HHHHHHHHHH 151 KKHLTLSFVE FTILAIITSQ NKNIVLLKDL IETIHHKYPQ TVRALNNLKK HHHHHHHHHH HHHHHHHHHH HTT SS S SHHHHHHHH HHHHHHHHHH 151 KKHLTLSFVE FTILAIITSQ NKNIVLLKDL IETIHHKYPQ TVRALNNLKK HHH SS HHH HHHHHHHHTT TT EEHHHH HHHSSS HHH HHHHHHHHHH 201 QGYLIKERST EDERKILIHM DDAQQDHAEQ LLAQVNQLLA DKDHLHLVFE HHH SS HHH HHHHHHHHTT TT EEHHHH HHHSSS HHH HHHHHHHHHH 201 QGYLIKERST EDERKILIHM DDAQQDHAEQ LLAQVNQLLA DKDHLHLVFE HTSSEEEE S SSTT EEEE HHHHHHHHH HHHHHHHHTS SS TT SS HTSSEEEE S SSTT EEEE HHHHHHHHH HHHHHHHHTS SS TT SS SARS Protein From Staphylococcus Aureus First two levels of protein structure
Why predict when we can get the real thing? PDB structures :: protein structures UniProt Release 3.5 consists of: Swiss-Prot Release : protein sequences TrEMBL Release : protein sequences Primary structure Secondary structure Tertiary structure Quaternary structure Function No problems Overall 77% accurate at predicting Overall 35% accurate at predicting No reliable means of predicting yet Do you feel like guessing? Secondary structure is derived by tertiary coordinates To get to tertiary structure we need NMR, X-ray We have an abundance of primaries..so why not use them?
ALPHA-HELIX: Hydrophobic-hydrophilic residue periodicity patterns BETA-STRAND: Edge and buried strands, hydrophobic-hydrophilic residue periodicity patterns OTHER: Loop regions contain a high proportion of small polar residues like alanine, glycine, serine and threonine. The abundance of glycine is due to its flexibility and proline for entropic reasons relating to the observed rigidity in its kinking the main-chain. As proline residues kink the main-chain in an incompatible way for helices and strands, they are normally not observed in these two structures, although they can occur in the N- terminal two positions of -helices. Some SSE rules that help Edge Buried
Using computers in predicting protein secondary has its onset 30 ago (Nagano (1973) J. Mol. Biol., 75, 401) on single sequences. The accuracy of the computational methods devised early-on was in the range 50-56% (Q3). The highest accuracy was achieved by Lim with a Q3 of 56% ( Lim, V. I. (1974) J. Mol. Biol., 88, 857). The most widely used method was that of Chou- Fasman (Chou, P. Y., Fasman, G. D. (1974) Biochemistry, 13, 211). Random prediction would yield about 40% (Q3) correctness given the observed distribution of the three states H, e and C in globular proteins (with generally about 30% helix, 20% strand and 50% coil). Historical background Nagano 1973 – Interactions of residues in a window of 6. The interactions were linearly combined to calculate interacting residue propensities for each SSE type (H, E or C) over 95 crystallographically determined protein tertiary structures. Lim 1974 – Predictions are based on a set of complicated stereochemical prediction rules for helices and sheets based on their observed frequencies in globular proteins. Chou-Fasman Predictions are based on differences in residue type composition for three states of secondary structure: helix, strand and turn (i.e., neither helix nor strand). Neighbouring residues were checked for helices and strands and predicted types were selected according to the higher scoring preference and extended as long as unobserved residues were not detected (e.g. proline) and the scores remained high.
The older standard: GOR The GOR method (version IV) was reported by the authors to perform single sequence prediction accuracy with an accuracy of 64.4% as assessed through jackknife testing over a database of 267 proteins with known structure. (Garnier, J. G., Gibrat, J.-F.,, Robson, B. (1996) In: Methods in Enzymology (Doolittle, R. F., Ed.) Vol. 266, pp ) The GOR method relies on the frequencies observed for residues in a 17- residue window (i.e. eight residues N- terminal and eight C-terminal of the central window position) for each of the three structural states.
The sliding window: GOR Sliding window Sequence of known structure A constant window of n residues long slides along sequence Central residue The amino acid frequencies are converted to secondary structure propensities for the central window position using an information function based on conditional probabilities. As it is not feasible to sample all possible 17-residue fragments directly from the PDB (there are possibilities) increasingly complex approximations have been applied. In GOR I and GOR II, the 17 positions in the window were treated as being independent, and so single-position information could be summed over the 17-residue window. In GOR III, this approach was refined by including pair frequencies derived from 16 pairs between each non-central and the central residue in the 17-residue window. The current version, GOR IV combines pair-wise information over all possible paired positions in a window. E H H H E E E E The frequencies of the residues in the window are converted to probabilities of observing a SSE type
Accuracy burst due to four separate improvements 1)Using Multiple sequence Alignments instead of single sequence input 2)More advanced decision making algorithms 3)Improvement of sequence database search tools 1)PSI-BLAST (Altschul et al, 1997) – most widely used 2)SAM (Karplus et al, 1998) 4)Increasingly larger database size (more candidates)
Using Multiple Sequence Alignments Zvelebil et al. (1987) for the first time exploited multiple sequence alignments to predict secondary structure automatically by extending the GOR method and reported that predictions were improved by 9% compared to single sequence prediction. Multiple alignments, as opposed to single sequences, offer a much improved means to recognise positional physicochemical features such as hydrophobicity patterns. Moreover, they provide better insight into the positional constraints of the amino acid composition. Finally, the placement of gaps in the alignment can be indicative for loop regions. Levin et al. (1993) also quantified the effect and observed 8% increased accuracy when multiple alignments of homologous sequences with sequence identities of 25% were used. As a consequence, the current state-of-the-art methods all use input information from multiple sequence alignments but are sensitive to alignment quality. Sequence cheY (PDB code 3chy) AA |ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP| INIT PHD | EEEEEEE HHHHHHHHHHHHHHHHH E HHHHHHHHHH HHHEEE | Iter 1 PHD | EEEEEEEE HHHHHHHHHHHHHHH HHHHHHHH EEEEEE | Iter 2 PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHH EEEEEE | Iter 3 PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | Iter 4 PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEE | Iter 5 PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | Iter 6 PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHH EEEEEE | Iter 7 PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | Iter 8 PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEEE | Iter 9 PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHHH EEEEE | DSSP | TT EEEE S HHHHHHHHHHHHHHT EEEESSHHHHHHHHHH EEEEES S| AA |NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKLGM| INIT PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHH HHHHHHHHHHHHHH | Iter 1 PHD | HHHHHHEEEEEE HHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | Iter 2 PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | Iter 3 PHD | HHHHHHHHHHHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | Iter 4 PHD | HHHHH EEEEE HHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | Iter 5 PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | Iter 6 PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH | Iter 7 PHD | HHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | Iter 8 PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | Iter 9 PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH | DSSP |SS HHHHHHHHHH TTTTT EEEEESS HHHHHHHHHTT SEEEESS HHHHHHHHHHHHHHHT |
Requires an initial training phase TRAINING: Sequence fragments of a certain length derived from a database of known structures are used, so that the central residue of such fragments can be assigned the true secondary structural state as a label. Then a window of the same length is slid over the query sequence (or multiple alignment) and for each window the k most similar fragments are determined using a certain similarity criterion. The distribution of the thus obtained k secondary structure labels is then used to derive propensities for three SSE states (H,E or C). Improved Methods: K-Nearest Neighbour Sequence fragments from database of known structures Sliding window Central residue Similarity good enough Qseq PSS HHE
A neural network has to be trained. TRAINING: Like k-NN but this time the information is used to adjust the weights of the internal connections for optimising the grouping of a set of input patterns into a set of output patterns. Normally difficult to understand the internal functioning of the network. Beware: overtraining the network. Improved Methods: Neural Networks Neural networks are learning systems based upon complex non-linear statistics. They are organised as interconnected layers of input and output units, and can also contain intermediate (or "hidden") unit layers (neurons). Each unit in a layer receives information from one or more other connected units and determines its output signal based on the weights of the input signals (synapses). Sliding window Qseq Sequence database of known structures Central residue Neural Network The weights are adjusted according to the model used to handle the input data.
Neural networks Training an NN: Forward pass: the outputs are calculated and the error at the output units calculated. Backward pass: The output unit error is used to alter weights on the output units. Then the error at the hidden nodes is calculated (by back-propagating the error at the output units through the weights), and the weights on the hidden nodes altered using these values. For each data pair to be learned a forward pass and backwards pass is performed. This is repeated over and over again until the error is at a low enough level (or we give up). Y = 1 / (1+ exp(-k.(Σ W in * X in )), where W in is weight and X in is input The graph shows the output for k=0.5, 1, and 10, as the activation varies from -10 to 10.
Diversity and alignment size gives better predictions The reigning secondary structure prediction method for the last 5 years PSIPRED (Jones, 1999) incorporates multiple sequence information from database searching and neural nets. The method exploits position specific scoring matrices (PSSMs) as generated by the PSI-BLAST algorithm (Altschul et al, 1997) and feeds those to a two-layered neural network. Since the method invokes the PSI-BLAST database search engine to gather information from related sequences, the method only needs a single sequence as input. The accuracy of the PSIPRED method is 76.5%, as evaluated by the author. An investigation into the effects of larger databases and more accurate sequence selection methods has shown that these improvements provide better and more diverse MSAs for secondary structure prediction. (Przybylski, D. and Rost, B. (2002) Proteins, 46, )
The PHD method (Profile network from HeiDelberg) broke the 70% barrier of prediction accuracy. (Rost and Sander (1993) PHD, PHDpsi, PROFsec Since the original method, the BLAST search and MAXHOM alignment routines have been replaced by PSI-BLAST in PHDpsi and more recently the use of complex bi-directional neural networks have given rise to PROFsec which is a close competitor and in many cases better than PSIPRED. Three neural networks: 1)A 13 residue window slides over the alignment and produces 3-state raw secondary structure predictions. 2)A 17-residue window filters the output of network 1. The output of the second network then comprises for each alignment position three adjusted state probabilities. This post-processing step for the raw predictions of the first network is aimed at correcting unfeasible predictions and would, for example, change (HHHEEHH) into (HHHHHHH). 3)A network for a so-called jury decision between networks 1 and 2 and a set of independently trained networks (extra predictions to correct for training biases. The predictions obtained by the jury network undergo a final simple filtering step to delete predicted helices of one or two residues and changing those into coil.
How to develop a secondary structure prediction method Method For jackknife test: K=N-1 Database of N sequences with known structure Training set of K<N sequences with known structure Test set of T<<N sequences with known structure For jackknife test: T=1 Trained Method Prediction Standard of truth Assessment method(s) Prediction accuracy For full jackknife test: Repeat process N times and average prediction scores Other method(s) prediction Method benchmark
A jackknife test is a test scenario for prediction methods that need to be tuned using a training database. Its simplest form: For a database containing N sequences with known tertiary (and hence secondary) structure, a prediction is made for one test sequence after training the method on the remaining training database containing the N-1 remaining sequences (one- at-a-time jackknife testing). A complete jackknife test would involve N such predictions. If N is large enough, meaningful statistics can be derived from the observed performance. For example, the mean prediction accuracy and associated standard deviation give a good indication of the sustained performance of the method tested. If this is computationally too expensive, the db can be split in larger groups, which are then jackknifed. The Jackknife test
Protein Secondary structure: Standards of Truth What is a standard of truth? - a structurally derived secondary structure assignment Why do we need one? - it dictates how accurate our prediction is How do we get it? - methods use hydrogen-bonding patterns along the main-chain to define the Secondary Structure Elements (SSEs). 1)DSSP (Kabsch and Sander, 1983) – most popular 2)STRIDE (Frishman and Argos, 1995) 3)DEFINE (Richards and Kundrot, 1988)Annotation: Helix: 3/10-helix (G), -helix (H), -helix (I) Strand: -strand (E), -bulge (B) Turn: H-bonded turn (T), bend (S) Rest: Coil (“ “)
Assessing prediction accuracy How do we decide how good a prediction is? 1)Qn : the number of correctly predicted n SSE states over the total number of predicted states 2)SOV:the number of correctly predicted n SSE states over the total number of predictions with higher penalties for core segment regions (Zemla et al, 1999) 3)MCC:the number of correctly predicted n SSE states over the total number of predictions taking into account how many prediction errors were made foreach state Which one would you use? Biological information impact What are you testing? What is your prediction used for? Making sense of the scores: Compare to your selected Standard Of Truth Use all three to get a better picture
Automated Evaluation Initiatives The EVA Server CASP (also includes fold recognition assessments), CAFASP biannual experiments With the amount of methods available freely online, biologists are puzzled and have no way of knowing which one to use. These initiatives allow continual evaluation on sequences that are added to the PDB and use DSSP as a standard of truth. LETS GO TO WEB …
The consensus superiority Deriving a consensus from multiple methods is always more accurate than any one individual method used. Early on Jpred (Cuff and Barton, 1998) investigated weighted and un- weighted multiple method majority voting with a upper limit 4% increase. Nowadays, any three top scoring methods can be improved by 1.5-2% by simple majority voting consensus. It is the three clocks on a boat scenario. If one clock goes wrong, the likelihood that the other two will go wrong at the same time and in the same way is very low. We are currently completing a dynamic programming consensus algorithm that produces an optimally segmented consensus which is more biologically correct than simple majority voting and intend to set it as a standard on EVA for method consensus evaluation. Predictions set Max observations are kept as correct HHHEEEECEHHHEEEECE
A stepwise hierarchy 1)Sequence database searching PSI-BLAST, SAM-T2K 2) Multiple sequence alignment of selected sequences PSSMs, HMM models, MSAs 3) Secondary structure prediction of query sequences based on the generated MSAs Single methods: PHD, PROFsec, PSIPred, SSPro, JNET, YASPIN consensus methods
Trained machine-learning Algorithm(s) Secondary structure prediction PSSMCheck fileHMM model SAM-T2KPSI-BLAST Sequence database Sequence database Single sequence Homologous sequences MSA MSA method Step 1: Database sequence search Step 2: MSA Step 3: SS Prediction Trained machine-learning Algorithm(s) Secondary structure prediction SAM-T2KPSI-BLAST Sequence database Sequence database Single sequence Homologous sequences MSA MSA method Optimised MSA and SS prediction Step 1: Database sequence search Step 2: MSA Step 3: SS Prediction Iterative MSA/SS prediction mutual optimisation Iterative homologue detection by optimised information