Presentation is loading. Please wait.

Presentation is loading. Please wait.

Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology.

Similar presentations


Presentation on theme: "Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology."— Presentation transcript:

1 Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology Laboratory, Dept. of Computer Science University of Missouri – Columbia, MO 65211, USA

2 Outline l Introduction å Protein secondary structure prediction å Popular methods å K-Nearest Neighbor method å Fuzzy K-Nearest Neighbor method l Methods l Filtering the prediction l Results and discussion l Summary and Future work

3 Introduction l Goal: Given a sequence of amino acids, predict in which one of the eight possible secondary structures states {H, G, I, B, E, C, S,T} will each residue fold in to. l CASP convention å {H,G,I} → H å {B,E} → E å {C,S,T} → C l Example: Amino Acid VKDGYIVDXVNCTYFCGRNAYCNEECTKLXGEQWASPYYCYXLPDHVRTKGPGRCH Secondary Structure CEEEEEECCCCCCCCCCCHHHHHHHHHHCCCCEEEECCEEEEECCCCCCCCCCCCC

4 Protein 3-Dimensional structure

5 Importance of Secondary Structure l An intermediate step in 3D structure prediction å structure → function l Classification å Ex: α, β, α/β, α+β l Helps in protein folding pathway determination

6 Existing Methods l Popular Methods å Neural Network methods X Ex: PSIPRED, PHD å Nearest Neighbor methods X Ex: NNSSP å Hidden Markov Model methods

7 Why K-Nearest Neighbors method? l Methods based on Neural Networks and Hidden Markov models å perform well if the query protein have many homologs in the sequence database å not easily expandable l The 1-Nearest Neighbor rule is bound above by no more than twice the optimal Baye’s error rate [Keller et. al, 1985] l K-NN will work better and better as more and more structures are being solved

8 K-Nearest Neighbor Algorithm Instances to be classified Classified instances

9 Instances to be classified Classified instances K-Nearest Neighbor Algorithm

10 Instances to be classified class B class F

11 K-Nearest Neighbor Algorithm l Advantages of Nearest Neighbor methods å Simple and transparent model å New structures can be added without re-training å Linear complexity l Disadvantage å Slower compared to other models as processing is delayed until prediction is needed

12 Why Fuzzy K-NN? l Disadvantages of Crisp K-NN å Atypical examples are given as much as weight as those that truly represent a particular class å Once instance is assigned to a class, there is no indication of its “strength” of its membership in that class

13 - - - N L G A G N S G L N L G H V A L T F

14 - - - N L G A - - - N L G A G N S G L N L G H V A L T F

15 - - - N L G A- - N L G A G - - - N L G A G N S G L N L G H V A L T F

16 - - - N L G A- - N L G A G- N L G A G N - - - N L G A G N S G L N L G H V A L T F

17 - - - N L G A- - N L G A G- - N L G A NN L G A G N S - - - N L G A G N S G L N L G H V A L T F

18 - - - N L G A- - N L G A G-N L G A G N S L G A G N S G - - - N L G A G N S G L N L G H V A L T F

19 Position Specific Scoring Matrix... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3... Length of protein(l) 20 PSI-BLAST... N L G A G N S G L N L G H V A L T F... ARNDCQEGHILKMFPSTWYVARNDCQEGHILKMFPSTWYV

20 Why Profile-FKNN? l Evolutionary information has been shown to increase the accuracy of secondary structure prediction by many popular methods l An attempt to combine the advantages of incorporating the evolutionary information, fuzzy set theory and nearest neighbor methods

21 Methods l Calculate profiles using PSI-BLAST å The popular Rost and Sander database of 126 representative proteins (<25% sequence Identity) l Find K-Nearest Neighbors l Calculate the membership values of the neighbors l Calculate the membership values of the current residue l Assign classes l Filter the output

22 Profile Calculation l The profiles of both the query protein and the test protein are calculated using the program PSI-BLAST l Parameters for PSI-BLAST å Expectation Value (e) = 0.1 å Maximum number of passes (j) = 3 å E-value threshold for inclusion in multi-pass model (h) = 5 å Default values for the rest of the parameters

23 K-Nearest Neighbors l For each profile-window in the query protein, the position-weighted absolute distance ‘d’ is calculated from all profile-windows of all proteins in the database. l The profile-windows corresponding to K smallest distances are retained as the K-Nearest Neighbors

24 Distance Calculation... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F...

25 ... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F... Distance Calculation

26 ... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F... Distance Calculation

27 ... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F... Distance Calculation

28 ... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F... Distance Calculation

29 ... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F... Distance Calculation

30 Membership Values of the Neighbors l The memberships of the nearest neighbors are assigned based on their corresponding secondary structures in various positions in the window l The residues near to the center are weighed more than the residues that are farther away

31 Membership values of the Neighbors 0.067 0.133 0.20 0.20 0.20 0.133 0.067 H E 1 1 1 C 1 1 1 1 C C E E E C C H = 0 E = 0.200x1 + 0.200x1 + 0.20x1 = 0.6 C = 0.067x1 + 0.133x1 +0.133x1 + 0.067x1 = 0.4 C C E E E C CE N L G A G N SA

32 Membership Value l The membership values of each residue in classes Helix, Sheet and Coil is calculated from the corresponding neighbors using the Fuzzy K-NN algorithm l Each residue is assigned to class in which it has the highest membership value Helix =... 15 22 61 91 95 96 26 21 23 18 29 30 24 17 5 8... Sheet =... 22 28 13 1 1 2 8 8 12 11 42 44 46 29 14 10... Coil =... 63 50 26 8 4 2 65 71 65 71 29 26 31 53 81 82... Final =... C C H H H H C C C C E E E C C C...

33 Fuzzy K-Nearest neighbor Algorithm BEGIN Initialize i=1. DO UNTIL(r assigned membership in all classes) Compute u i (r) using Increment i. END DO UNTIL END Where, u i = membership value of residue ‘r’ in class ‘i’, i = Helix, Sheet or Coil d(r,r j )= distance between query window centered in residue ‘r’ its j th neighbor m = 2 (Fuzzifier)

34 Structure Filtration l In the basic setting, the secondary structure state is class with highest membership value l Unrealistic structures may be present l Popular methods of structure filtration å Neural Network å Heuristic based

35 Heuristic Filter 1. Smoothen the memberships values 2. Filter unrealistic structures l Helix > 3 amino acids,  -sheet > 2 amino acids 3. Calculate the thresholds to filter noise 4. Mark the possible Helix and Sheet regions l Resolve conflicts based on average membership value in overlap region 5. Fill the rest of the structure with Coil

36 Filter: Final Structure Unfiltered CCCCCHCCCCCHHHHHHHHCCCCCCEEEEECCCCCCCCCCCCCEEEEEECCCCCCHHHCCCCC Target CCCHHHCCCCHHHHHHHHHHHCCCCEEEEEECCCCEECCCCCCEEEEEEECCCCEECCCCEEC Filtered CCHHHHCCCHHHHHHHHHHHHHCCCEEEEEECCCCCCCCCCCCEEEEEEECCCCCCCCCCCCC

37 Metrics l Seven commonly used metrics å Q 3 = Number of correctly predicted residues x 100 Total number of residues å Q = Number of residues correctly predicted X100 Total number of residues in å Matthew’s Correlation Coefficient MCC = where, p – true positives n – true negatives u – false negatives o – false positives

38 Results Q 3 (%)Q H (%)Q E (%)Q C (%)MHMH MEME MCMC Unfiltered 74.069.655.879.90.580.610.54 Filtered 76.268.166.180.40.64 0.56 Performance on database of 1973 proteins (<25% sequence identity) generated by the PISCES 1 server 1. G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003.

39 Relative Performance MethodAccuracy MBR 1 66.40 NN 2 68.00 NNSSP 3 72.20 PFKNN76.20 1.X. Zhang, J. P. Mesirov and D.L Waltz. Hybrid system for Protein Secondary Structure Prediction. J. Mol. Biol., 225:1049-1063, 1992 2.Tau-Mu Yi and E. S. Lander. Protein Secondary Structure Prediction using Nearest-Neighbor Methods. J. Mol. Biol., 232:1117-1129, 1993 3.A. A. Salamov and V. V. Solovyev. Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithm and Multiple Sequence Alignments. J. Mol. Biol., 247:11-15, 1995

40 Summary l A novel approach for PSSP å Evolutionary information å K-Nearest Neighbor algorithm å Fuzzy set theory l Most accurate KNN approach to date l Easily expandable l Accuracy increases with new structures l Average computing time < 1 min on a single CPU machine

41 Future Work l System with faster search capabilities å Efficient search for neighbors l Accurate prediction system

42 Acknowledgements l Dr. James Keller for insight into the Fuzzy K-Nearest Neighbor Algorithm l Oak Ridge National Laboratory for providing the supercomputing facilities l Members of Digital Biology Laboratory for their support

43 Software The enhanced version of the software is coded in C and is available upon request. Please e-mail your requests to Raj@mizzou.edu or XuDong@missouri.edu

44 Thank you for Participation!


Download ppt "Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology."

Similar presentations


Ads by Google