Download presentation
Presentation is loading. Please wait.
Published byChristal May Modified over 8 years ago
1
“ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99 Bio Learning Group Seminar Speaker: Eom Jae-Hong 2001/03/19
2
2 Abstract Investigate the correlation Between sequence separation and distance. For pairs of amino acid where the distance atom is smaller than the threshold found characteristic sequence motif. The motifs change as the sequence separation increases. Find correlations between the residues in the center of the motif. Used for design new NN with statistical analysis. Statistical analysis Explains why neural networks perform better than simple statistical data-driven approaches (ex: pair probability density functions).
3
3 Introduction The ability to adopt structure from sequences Depends on constructing an appropriate cost function for native structure. To find this function Concentrate on finding a method to predict distance constraints, That correlate well with the observed distances in proteins. The neural network approach is The only approach so far which includes sequence context for the considered pair of amino acids. Perform better. Capture more features relating distance constraints and sequence composition.
4
4 Introduction (cont ’ d) The analysis include investigation of the distances Between amino acids as well as sequence motifs and correlation for separated residues. Construct a prediction scheme. Significantly improve on an earlier approach (Lund et al. 1997) For each particular sequence separation The corresponding distance threshold is computed As the avg. of all physical distances in a large data set between any two amino acids separated by the amount of residues (Lund et al. 1997).
5
5 Introduction (cont ’ d) Here, Include an analysis of the distance distributions relative to these threshold. Use this to explain qualitative behavior of the neural network prediction scheme. Analysis of the network weight composition Reveal intriguing properties of the distance constraints. “The sequence motifs can be decomposed into sub-motifs associated with each of the hidden units in the neural network.” The sequence separation increases there is A clear correspondence in the change of the mean value, distance distribution, and the sequence motifs describing the distance constraints of the separated amino acids. The predicted distance constraints May be used as inputs to threading or loop modeling algorithm.
6
6 Data Extracted from the Brookhaven Protein Data Bank (Bernstein et al. 1977), containing 5762 proteins. Entries were excluded if: The secondary structure of the proteins could not be assigned by the program DSSP (Kabsch & Sander 1983). The proteins had any physical chain breaks. They had a resolution value grater than 2.5 Angstrom. Individual chains of entries were discarded if: They had a length of less than 30 amino acids. They had less than 50% secondary structure assign as defined by the program DSSP. They had more than 85% non-amino acids in the sequence. They had more than 10% of non-standard amino acids (B, X, Z). Material and Method – Data extraction
7
7 Representative set with low pairwise sequence selection By running algorithm #1 of Hobohm et al. (1992) implemented in the program RedHom (Lund et al. 1997). Sequence sorting According to resolution (all NMR structures were assigned resolution 100). The sequence with the same resolution were sorted so that higher priority was given to longer proteins. Sequence aligned (local alignment program) Ssearch (Myers & Miller 1998; Pearson 1990) Using pam120 amino acid substitution matrix (Dayhoff & Orcutt 1978) with gap penalties –12, -4. Use Cut off threshold for seq. similarity. Material and Method – Data extraction % of identity in the alignment
8
8 Material and Method – Data extraction Ten cross-validation sets were selected such that They all contain approximately the same number of residues. And all have the same length distribution of the chains. All the data are made publicly available http://www.dtu.dk/service/distanceP/. http://www.dtu.dk/service/distanceP/
9
9 - Information Content / Relative entropy measure Relative entropy Used to measure the information content (Kullback & Leibler 1951) of aligned regions between separated residues. Information content The information content at each position will sometimes be displayed as a sequence logo (Schneider & Stephens 1990). The position-dependent information content is given by Symbols in logos turned 180 degrees Observed fraction of symbol k at position i Background probability of finding symbol k by chance in the seq.
10
10 - Neural networks In previous work (Baldi & Brunak 1998, …) Applied two-layer feed-forward neural networks Trained by standard back-propagation. To predict whether two residues are below or above a given distance threshold in space. Lund et al. 1997 The inputs were processed as two windows centered around each of the separated amino acids. Here, Extend previous scheme By allowing the windows to grow towards each other, and even merge to a single large window covering the complete seq. between the separated amino acids. Increases the computational requirements. But, allow us to search for optimal covering between the separated amino acids.
11
11 - Neural networks Positive (contact) and negative (non contact) windows. Apply the balanced learning approach (Rost & sander 1993). Training Done by a 10 set cross-validation approach (bishop 1996). Calculate the average performance over the partitions. The performance on each partition is evaluated by Mathews correlation coefficient (Mathews 1975). The analysis of the patterns stored in the weights of the network is done through the salience. The cost of removing a single weight while keeping remaining ones.
12
12 - Neural networks (cont ’ d) Each weight Connected to a hidden unit corresponds exactly to a particular amino acids at a given position in the seq. Windows used as inputs. Due to the sparse encoding. Obtain a ranking of symbols On each position in the input fields. To compute the saliencies Use the approximation for two-layer one-output networks (Gorodkin et al. 1997). The saliencies for the weights between input and hidden layer can be written as Weight between input i and hidden unit j. The weight between hidden unit j and the output.
13
13 Results Conduct statistical analysis of the data and distance constraints between amino acids. Use the result to Design and explain the behavior of a neural network prediction scheme with enhanced performance.
14
14 - Statistical analysis Derive the mean of all the distance Between pairs of atoms. Use these means as distance constraint thresholds. To analyze which pair are above and below the threshold, it is relevant to compare: The distribution of distances between amino acid pairs below and above the threshold. The sequence composition of segments Where the pairs of amino acids are below and above the threshold. Investigate the length distribution of the distances As function of the sequence separation.
15
15 - Statistical analysis (cont ’ d) From figure 1. -helices : make distinct peak up to 20 separations -helices : make distinct peak up to 5 separations The distance distribution of separation 3 Data is most bimodal. Provides the most distinct partition of the data points. the best prediction of distance constraints can be obtained. -helices peak shifts relative to the mean When separation is 11 to 13. For an optimized predictor, it can be slightly harder To predict distance constraints for separation 12 than 11 or 13.
16
16 Distance distribution approaches a universal shape
17
17 - Statistical analysis (cont ’ d) Helices at 12 residues coincide with seq. sep. 12. hard to predict distance constraints. Bimodal distribution unimodal distribution prediction of distance constraints become harder with increasing sequence separation. The Universality only appears When the distribution is displaced by its mean distance. we can use mean as a threshold.
18
18
19
19 - Statistical analysis (cont ’ d) To use the information available in the sequence Sequence segments above the threshold use to calculate position-dependent background distribution. Sequence segments below the threshold all aligned and displayed in sequence logo using computed background distribution. Sequence information content curve Figure 3. Corresponding sequence logo Figure 4. For larger sequence separation, the motif consists of 3 peaks 1 center peak, 2 peaks of separated amino acids.
20
20 - Statistical analysis (cont ’ d) Sequence information content curves
21
21 - Statistical analysis (cont ’ d) Smear out at 20-30 separation (universal distance distribution) sequence motif becomes “universal”
22
22
23
23
24
24 Neural networks: prediction and behavior Use NN to predict optimal distance constraints Have to consider sequence separation distance. Two-layer network 1 output unit, 5 hidden units. The size of the input field may vary. (Quantitative) Investigate the relation between The seq. motifs in the logos and the amount of sequence context needed in prediction scheme. Choose, the amount of seq. context with local windows around the separated amino acids extend tie seq. region r. For all seq. sep.s 2 to 99, train 8000 networks, use 10-cross validation.
25
25 Neural networks: prediction and behavior
26
26 Due to the lack of motif
27
27 Best performing network is that Which uses as much context as possible. More than 30-35 residues The amount of used context is not a factor anymore. Performance curve fluctuation occur. We can use networks as an indicator for When a sequence motif is well defined (using fluctuation). Independent prediction test on nine CASP3 targets Prev. Method : 64.5% correct prediction With 0.224 correlation coefficient. 70.3% with 0.249 Neural networks: prediction and behavior
28
28 Prediction example of distance constraints for R0067. Result Predictions up to a sequence separation 30 clearly capture the main part of the distance constraints. Neural networks: prediction and behavior
29
29 Neural networks: prediction and behavior (Qualitative) Investigate the relation between The network performance and information content in the sequence logos. two curves have the same qualitative behavior as the sequence separation increase (figure 7). Peak at separation 3. Drop at separation 12. Plateau for seq. sep. 30. ? Decreasing sampling size!!
30
30
31
31
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.