Presentation is loading. Please wait.

Presentation is loading. Please wait.

“ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99.

Similar presentations


Presentation on theme: "“ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99."— Presentation transcript:

1 “ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99 Bio Learning Group Seminar Speaker: Eom Jae-Hong 2001/03/19

2 2 Abstract Investigate the correlation  Between sequence separation and distance. For pairs of amino acid where the distance atom is smaller than the threshold   found characteristic sequence motif. The motifs change as the sequence separation increases. Find correlations between the residues in the center of the motif.   Used for design new NN with statistical analysis. Statistical analysis  Explains why neural networks perform better than simple statistical data-driven approaches (ex: pair probability density functions).

3 3 Introduction The ability to adopt structure from sequences  Depends on constructing an appropriate cost function for native structure.  To find this function  Concentrate on finding a method to predict distance constraints,  That correlate well with the observed distances in proteins. The neural network approach is  The only approach so far which includes sequence context for the considered pair of amino acids.  Perform better.  Capture more features relating distance constraints and sequence composition.

4 4 Introduction (cont ’ d) The analysis include investigation of the distances  Between amino acids as well as sequence motifs and correlation for separated residues. Construct a prediction scheme.  Significantly improve on an earlier approach (Lund et al. 1997) For each particular sequence separation  The corresponding distance threshold is computed  As the avg. of all physical distances in a large data set between any two amino acids separated by the amount of residues (Lund et al. 1997).

5 5 Introduction (cont ’ d) Here, Include an analysis of the distance distributions relative to these threshold.  Use this to explain qualitative behavior of the neural network prediction scheme. Analysis of the network weight composition  Reveal intriguing properties of the distance constraints.   “The sequence motifs can be decomposed into sub-motifs associated with each of the hidden units in the neural network.”  The sequence separation increases there is  A clear correspondence in the change of the mean value, distance distribution, and the sequence motifs describing the distance constraints of the separated amino acids.  The predicted distance constraints  May be used as inputs to threading or loop modeling algorithm.

6 6 Data  Extracted from the Brookhaven Protein Data Bank (Bernstein et al. 1977), containing 5762 proteins.  Entries were excluded if:  The secondary structure of the proteins could not be assigned by the program DSSP (Kabsch & Sander 1983).  The proteins had any physical chain breaks.  They had a resolution value grater than 2.5 Angstrom.  Individual chains of entries were discarded if:  They had a length of less than 30 amino acids.  They had less than 50% secondary structure assign as defined by the program DSSP.  They had more than 85% non-amino acids in the sequence.  They had more than 10% of non-standard amino acids (B, X, Z). Material and Method – Data extraction

7 7  Representative set with low pairwise sequence selection  By running algorithm #1 of Hobohm et al. (1992) implemented in the program RedHom (Lund et al. 1997).  Sequence sorting  According to resolution (all NMR structures were assigned resolution 100).  The sequence with the same resolution were sorted so that higher priority was given to longer proteins.  Sequence aligned (local alignment program)  Ssearch (Myers & Miller 1998; Pearson 1990)  Using pam120 amino acid substitution matrix (Dayhoff & Orcutt 1978) with gap penalties –12, -4.  Use Cut off threshold for seq. similarity. Material and Method – Data extraction % of identity in the alignment

8 8 Material and Method – Data extraction  Ten cross-validation sets were selected such that  They all contain approximately the same number of residues.  And all have the same length distribution of the chains.  All the data are made publicly available  http://www.dtu.dk/service/distanceP/. http://www.dtu.dk/service/distanceP/

9 9 - Information Content / Relative entropy measure Relative entropy  Used to measure the information content (Kullback & Leibler 1951) of aligned regions between separated residues.  Information content  The information content at each position will sometimes be displayed as a sequence logo (Schneider & Stephens 1990).  The position-dependent information content is given by  Symbols in logos turned 180 degrees   Observed fraction of symbol k at position i Background probability of finding symbol k by chance in the seq.

10 10 - Neural networks In previous work (Baldi & Brunak 1998, …)  Applied two-layer feed-forward neural networks  Trained by standard back-propagation.  To predict whether two residues are below or above a given distance threshold in space. Lund et al. 1997  The inputs were processed as two windows centered around each of the separated amino acids. Here, Extend previous scheme  By allowing the windows to grow towards each other, and even merge to a single large window covering the complete seq. between the separated amino acids.  Increases the computational requirements.  But, allow us to search for optimal covering between the separated amino acids.

11 11 - Neural networks  Positive (contact) and negative (non contact) windows.  Apply the balanced learning approach (Rost & sander 1993).  Training  Done by a 10 set cross-validation approach (bishop 1996).  Calculate the average performance over the partitions.  The performance on each partition is evaluated by Mathews correlation coefficient (Mathews 1975).  The analysis of the patterns stored in the weights of the network is done through the salience.  The cost of removing a single weight while keeping remaining ones.

12 12 - Neural networks (cont ’ d)  Each weight  Connected to a hidden unit corresponds exactly to a particular amino acids at a given position in the seq. Windows used as inputs.  Due to the sparse encoding.  Obtain a ranking of symbols  On each position in the input fields. To compute the saliencies  Use the approximation for two-layer one-output networks (Gorodkin et al. 1997).  The saliencies for the weights between input and hidden layer can be written as Weight between input i and hidden unit j. The weight between hidden unit j and the output.

13 13 Results Conduct statistical analysis of the data and distance constraints between amino acids. Use the result to  Design and explain the behavior of a neural network prediction scheme with enhanced performance.

14 14 - Statistical analysis Derive the mean of all the distance  Between pairs of atoms.  Use these means as distance constraint thresholds. To analyze which pair are above and below the threshold, it is relevant to compare:  The distribution of distances between amino acid pairs below and above the threshold.  The sequence composition of segments  Where the pairs of amino acids are below and above the threshold. Investigate the length distribution of the distances  As function of the sequence separation.

15 15 - Statistical analysis (cont ’ d) From figure 1.  -helices : make distinct peak up to 20 separations  -helices : make distinct peak up to 5 separations  The distance distribution of separation 3  Data is most bimodal.  Provides the most distinct partition of the data points.   the best prediction of distance constraints can be obtained.  -helices peak shifts relative to the mean  When separation is 11 to 13.  For an optimized predictor, it can be slightly harder  To predict distance constraints for separation 12 than 11 or 13.

16 16 Distance distribution approaches a universal shape

17 17 - Statistical analysis (cont ’ d) Helices at 12 residues coincide with seq. sep. 12.   hard to predict distance constraints. Bimodal distribution  unimodal distribution   prediction of distance constraints become harder with increasing sequence separation. The Universality only appears  When the distribution is displaced by its mean distance.   we can use mean as a threshold.

18 18

19 19 - Statistical analysis (cont ’ d) To use the information available in the sequence  Sequence segments above the threshold   use to calculate position-dependent background distribution.  Sequence segments below the threshold   all aligned and displayed in sequence logo using computed background distribution.  Sequence information content curve  Figure 3.  Corresponding sequence logo  Figure 4.  For larger sequence separation, the motif consists of 3 peaks  1 center peak, 2 peaks of separated amino acids.

20 20 - Statistical analysis (cont ’ d) Sequence information content curves

21 21 - Statistical analysis (cont ’ d) Smear out at 20-30 separation (universal distance distribution) sequence motif becomes  “universal”

22 22

23 23

24 24 Neural networks: prediction and behavior Use NN to predict optimal distance constraints  Have to consider sequence separation distance. Two-layer network  1 output unit, 5 hidden units.  The size of the input field may vary. (Quantitative) Investigate the relation between  The seq. motifs in the logos and the amount of sequence context needed in prediction scheme.  Choose, the amount of seq. context with local windows around the separated amino acids  extend tie seq. region r.  For all seq. sep.s 2 to 99, train 8000 networks, use 10-cross validation.

25 25 Neural networks: prediction and behavior

26 26 Due to the lack of motif

27 27 Best performing network is that  Which uses as much context as possible.  More than 30-35 residues  The amount of used context is not a factor anymore.  Performance curve fluctuation occur. We can use networks as an indicator for  When a sequence motif is well defined (using fluctuation). Independent prediction test on nine CASP3 targets  Prev. Method : 64.5% correct prediction  With 0.224 correlation coefficient.   70.3% with 0.249 Neural networks: prediction and behavior

28 28 Prediction example of distance constraints for R0067. Result  Predictions up to a sequence separation 30 clearly capture the main part of the distance constraints. Neural networks: prediction and behavior

29 29 Neural networks: prediction and behavior (Qualitative) Investigate the relation between  The network performance and information content in the sequence logos.   two curves have the same qualitative behavior as the sequence separation increase (figure 7).  Peak at separation 3.  Drop at separation 12.  Plateau for seq. sep. 30.  ? Decreasing sampling size!!

30 30

31 31


Download ppt "“ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99."

Similar presentations


Ads by Google