Download presentation
Presentation is loading. Please wait.
1
Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology Laboratory, Dept. of Computer Science University of Missouri – Columbia, MO 65211, USA
2
Outline l Introduction å Protein secondary structure prediction å Popular methods å K-Nearest Neighbor method å Fuzzy K-Nearest Neighbor method l Methods l Filtering the prediction l Results and discussion l Summary and Future work
3
Introduction l Goal: Given a sequence of amino acids, predict in which one of the eight possible secondary structures states {H, G, I, B, E, C, S,T} will each residue fold in to. l CASP convention å {H,G,I} → H å {B,E} → E å {C,S,T} → C l Example: Amino Acid VKDGYIVDXVNCTYFCGRNAYCNEECTKLXGEQWASPYYCYXLPDHVRTKGPGRCH Secondary Structure CEEEEEECCCCCCCCCCCHHHHHHHHHHCCCCEEEECCEEEEECCCCCCCCCCCCC
4
Protein 3-Dimensional structure
5
Importance of Secondary Structure l An intermediate step in 3D structure prediction å structure → function l Classification å Ex: α, β, α/β, α+β l Helps in protein folding pathway determination
6
Existing Methods l Popular Methods å Neural Network methods X Ex: PSIPRED, PHD å Nearest Neighbor methods X Ex: NNSSP å Hidden Markov Model methods
7
Why K-Nearest Neighbors method? l Methods based on Neural Networks and Hidden Markov models å perform well if the query protein have many homologs in the sequence database å not easily expandable l The 1-Nearest Neighbor rule is bound above by no more than twice the optimal Baye’s error rate [Keller et. al, 1985] l K-NN will work better and better as more and more structures are being solved
8
K-Nearest Neighbor Algorithm Instances to be classified Classified instances
9
Instances to be classified Classified instances K-Nearest Neighbor Algorithm
10
Instances to be classified class B class F
11
K-Nearest Neighbor Algorithm l Advantages of Nearest Neighbor methods å Simple and transparent model å New structures can be added without re-training å Linear complexity l Disadvantage å Slower compared to other models as processing is delayed until prediction is needed
12
Why Fuzzy K-NN? l Disadvantages of Crisp K-NN å Atypical examples are given as much as weight as those that truly represent a particular class å Once instance is assigned to a class, there is no indication of its “strength” of its membership in that class
13
- - - N L G A G N S G L N L G H V A L T F
14
- - - N L G A - - - N L G A G N S G L N L G H V A L T F
15
- - - N L G A- - N L G A G - - - N L G A G N S G L N L G H V A L T F
16
- - - N L G A- - N L G A G- N L G A G N - - - N L G A G N S G L N L G H V A L T F
17
- - - N L G A- - N L G A G- - N L G A NN L G A G N S - - - N L G A G N S G L N L G H V A L T F
18
- - - N L G A- - N L G A G-N L G A G N S L G A G N S G - - - N L G A G N S G L N L G H V A L T F
19
Position Specific Scoring Matrix... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3... Length of protein(l) 20 PSI-BLAST... N L G A G N S G L N L G H V A L T F... ARNDCQEGHILKMFPSTWYVARNDCQEGHILKMFPSTWYV
20
Why Profile-FKNN? l Evolutionary information has been shown to increase the accuracy of secondary structure prediction by many popular methods l An attempt to combine the advantages of incorporating the evolutionary information, fuzzy set theory and nearest neighbor methods
21
Methods l Calculate profiles using PSI-BLAST å The popular Rost and Sander database of 126 representative proteins (<25% sequence Identity) l Find K-Nearest Neighbors l Calculate the membership values of the neighbors l Calculate the membership values of the current residue l Assign classes l Filter the output
22
Profile Calculation l The profiles of both the query protein and the test protein are calculated using the program PSI-BLAST l Parameters for PSI-BLAST å Expectation Value (e) = 0.1 å Maximum number of passes (j) = 3 å E-value threshold for inclusion in multi-pass model (h) = 5 å Default values for the rest of the parameters
23
K-Nearest Neighbors l For each profile-window in the query protein, the position-weighted absolute distance ‘d’ is calculated from all profile-windows of all proteins in the database. l The profile-windows corresponding to K smallest distances are retained as the K-Nearest Neighbors
24
Distance Calculation... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F...
25
... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F... Distance Calculation
26
... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F... Distance Calculation
27
... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F... Distance Calculation
28
... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F... Distance Calculation
29
... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2...... N L G A G N S G L T F...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2...... 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3...... 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2...... 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3...... -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5...... 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3...... 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3...... -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4...... 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3...... 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3...... -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4...... 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3...... N L G A G N S G L N L G H V A L T F... Distance Calculation
30
Membership Values of the Neighbors l The memberships of the nearest neighbors are assigned based on their corresponding secondary structures in various positions in the window l The residues near to the center are weighed more than the residues that are farther away
31
Membership values of the Neighbors 0.067 0.133 0.20 0.20 0.20 0.133 0.067 H E 1 1 1 C 1 1 1 1 C C E E E C C H = 0 E = 0.200x1 + 0.200x1 + 0.20x1 = 0.6 C = 0.067x1 + 0.133x1 +0.133x1 + 0.067x1 = 0.4 C C E E E C CE N L G A G N SA
32
Membership Value l The membership values of each residue in classes Helix, Sheet and Coil is calculated from the corresponding neighbors using the Fuzzy K-NN algorithm l Each residue is assigned to class in which it has the highest membership value Helix =... 15 22 61 91 95 96 26 21 23 18 29 30 24 17 5 8... Sheet =... 22 28 13 1 1 2 8 8 12 11 42 44 46 29 14 10... Coil =... 63 50 26 8 4 2 65 71 65 71 29 26 31 53 81 82... Final =... C C H H H H C C C C E E E C C C...
33
Fuzzy K-Nearest neighbor Algorithm BEGIN Initialize i=1. DO UNTIL(r assigned membership in all classes) Compute u i (r) using Increment i. END DO UNTIL END Where, u i = membership value of residue ‘r’ in class ‘i’, i = Helix, Sheet or Coil d(r,r j )= distance between query window centered in residue ‘r’ its j th neighbor m = 2 (Fuzzifier)
34
Structure Filtration l In the basic setting, the secondary structure state is class with highest membership value l Unrealistic structures may be present l Popular methods of structure filtration å Neural Network å Heuristic based
35
Heuristic Filter 1. Smoothen the memberships values 2. Filter unrealistic structures l Helix > 3 amino acids, -sheet > 2 amino acids 3. Calculate the thresholds to filter noise 4. Mark the possible Helix and Sheet regions l Resolve conflicts based on average membership value in overlap region 5. Fill the rest of the structure with Coil
36
Filter: Final Structure Unfiltered CCCCCHCCCCCHHHHHHHHCCCCCCEEEEECCCCCCCCCCCCCEEEEEECCCCCCHHHCCCCC Target CCCHHHCCCCHHHHHHHHHHHCCCCEEEEEECCCCEECCCCCCEEEEEEECCCCEECCCCEEC Filtered CCHHHHCCCHHHHHHHHHHHHHCCCEEEEEECCCCCCCCCCCCEEEEEEECCCCCCCCCCCCC
37
Metrics l Seven commonly used metrics å Q 3 = Number of correctly predicted residues x 100 Total number of residues å Q = Number of residues correctly predicted X100 Total number of residues in å Matthew’s Correlation Coefficient MCC = where, p – true positives n – true negatives u – false negatives o – false positives
38
Results Q 3 (%)Q H (%)Q E (%)Q C (%)MHMH MEME MCMC Unfiltered 74.069.655.879.90.580.610.54 Filtered 76.268.166.180.40.64 0.56 Performance on database of 1973 proteins (<25% sequence identity) generated by the PISCES 1 server 1. G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003.
39
Relative Performance MethodAccuracy MBR 1 66.40 NN 2 68.00 NNSSP 3 72.20 PFKNN76.20 1.X. Zhang, J. P. Mesirov and D.L Waltz. Hybrid system for Protein Secondary Structure Prediction. J. Mol. Biol., 225:1049-1063, 1992 2.Tau-Mu Yi and E. S. Lander. Protein Secondary Structure Prediction using Nearest-Neighbor Methods. J. Mol. Biol., 232:1117-1129, 1993 3.A. A. Salamov and V. V. Solovyev. Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithm and Multiple Sequence Alignments. J. Mol. Biol., 247:11-15, 1995
40
Summary l A novel approach for PSSP å Evolutionary information å K-Nearest Neighbor algorithm å Fuzzy set theory l Most accurate KNN approach to date l Easily expandable l Accuracy increases with new structures l Average computing time < 1 min on a single CPU machine
41
Future Work l System with faster search capabilities å Efficient search for neighbors l Accurate prediction system
42
Acknowledgements l Dr. James Keller for insight into the Fuzzy K-Nearest Neighbor Algorithm l Oak Ridge National Laboratory for providing the supercomputing facilities l Members of Digital Biology Laboratory for their support
43
Software The enhanced version of the software is coded in C and is available upon request. Please e-mail your requests to Raj@mizzou.edu or XuDong@missouri.edu
44
Thank you for Participation!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.