Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology Laboratory, Dept. of Computer Science University of Missouri – Columbia, MO 65211, USA
Outline l Introduction å Protein secondary structure prediction å Popular methods å K-Nearest Neighbor method å Fuzzy K-Nearest Neighbor method l Methods l Filtering the prediction l Results and discussion l Summary and Future work
Introduction l Goal: Given a sequence of amino acids, predict in which one of the eight possible secondary structures states {H, G, I, B, E, C, S,T} will each residue fold in to. l CASP convention å {H,G,I} → H å {B,E} → E å {C,S,T} → C l Example: Amino Acid VKDGYIVDXVNCTYFCGRNAYCNEECTKLXGEQWASPYYCYXLPDHVRTKGPGRCH Secondary Structure CEEEEEECCCCCCCCCCCHHHHHHHHHHCCCCEEEECCEEEEECCCCCCCCCCCCC
Protein 3-Dimensional structure
Importance of Secondary Structure l An intermediate step in 3D structure prediction å structure → function l Classification å Ex: α, β, α/β, α+β l Helps in protein folding pathway determination
Existing Methods l Popular Methods å Neural Network methods X Ex: PSIPRED, PHD å Nearest Neighbor methods X Ex: NNSSP å Hidden Markov Model methods
Why K-Nearest Neighbors method? l Methods based on Neural Networks and Hidden Markov models å perform well if the query protein have many homologs in the sequence database å not easily expandable l The 1-Nearest Neighbor rule is bound above by no more than twice the optimal Baye’s error rate [Keller et. al, 1985] l K-NN will work better and better as more and more structures are being solved
K-Nearest Neighbor Algorithm Instances to be classified Classified instances
Instances to be classified Classified instances K-Nearest Neighbor Algorithm
Instances to be classified class B class F
K-Nearest Neighbor Algorithm l Advantages of Nearest Neighbor methods å Simple and transparent model å New structures can be added without re-training å Linear complexity l Disadvantage å Slower compared to other models as processing is delayed until prediction is needed
Why Fuzzy K-NN? l Disadvantages of Crisp K-NN å Atypical examples are given as much as weight as those that truly represent a particular class å Once instance is assigned to a class, there is no indication of its “strength” of its membership in that class
- - - N L G A G N S G L N L G H V A L T F
- - - N L G A N L G A G N S G L N L G H V A L T F
- - - N L G A- - N L G A G N L G A G N S G L N L G H V A L T F
- - - N L G A- - N L G A G- N L G A G N N L G A G N S G L N L G H V A L T F
- - - N L G A- - N L G A G- - N L G A NN L G A G N S N L G A G N S G L N L G H V A L T F
- - - N L G A- - N L G A G-N L G A G N S L G A G N S G N L G A G N S G L N L G H V A L T F
Position Specific Scoring Matrix Length of protein(l) 20 PSI-BLAST... N L G A G N S G L N L G H V A L T F... ARNDCQEGHILKMFPSTWYVARNDCQEGHILKMFPSTWYV
Why Profile-FKNN? l Evolutionary information has been shown to increase the accuracy of secondary structure prediction by many popular methods l An attempt to combine the advantages of incorporating the evolutionary information, fuzzy set theory and nearest neighbor methods
Methods l Calculate profiles using PSI-BLAST å The popular Rost and Sander database of 126 representative proteins (<25% sequence Identity) l Find K-Nearest Neighbors l Calculate the membership values of the neighbors l Calculate the membership values of the current residue l Assign classes l Filter the output
Profile Calculation l The profiles of both the query protein and the test protein are calculated using the program PSI-BLAST l Parameters for PSI-BLAST å Expectation Value (e) = 0.1 å Maximum number of passes (j) = 3 å E-value threshold for inclusion in multi-pass model (h) = 5 å Default values for the rest of the parameters
K-Nearest Neighbors l For each profile-window in the query protein, the position-weighted absolute distance ‘d’ is calculated from all profile-windows of all proteins in the database. l The profile-windows corresponding to K smallest distances are retained as the K-Nearest Neighbors
Distance Calculation N L G A G N S G L T F N L G A G N S G L N L G H V A L T F...
N L G A G N S G L T F N L G A G N S G L N L G H V A L T F... Distance Calculation
N L G A G N S G L T F N L G A G N S G L N L G H V A L T F... Distance Calculation
N L G A G N S G L T F N L G A G N S G L N L G H V A L T F... Distance Calculation
N L G A G N S G L T F N L G A G N S G L N L G H V A L T F... Distance Calculation
N L G A G N S G L T F N L G A G N S G L N L G H V A L T F... Distance Calculation
Membership Values of the Neighbors l The memberships of the nearest neighbors are assigned based on their corresponding secondary structures in various positions in the window l The residues near to the center are weighed more than the residues that are farther away
Membership values of the Neighbors H E C C C E E E C C H = 0 E = 0.200x x x1 = 0.6 C = 0.067x x x x1 = 0.4 C C E E E C CE N L G A G N SA
Membership Value l The membership values of each residue in classes Helix, Sheet and Coil is calculated from the corresponding neighbors using the Fuzzy K-NN algorithm l Each residue is assigned to class in which it has the highest membership value Helix = Sheet = Coil = Final =... C C H H H H C C C C E E E C C C...
Fuzzy K-Nearest neighbor Algorithm BEGIN Initialize i=1. DO UNTIL(r assigned membership in all classes) Compute u i (r) using Increment i. END DO UNTIL END Where, u i = membership value of residue ‘r’ in class ‘i’, i = Helix, Sheet or Coil d(r,r j )= distance between query window centered in residue ‘r’ its j th neighbor m = 2 (Fuzzifier)
Structure Filtration l In the basic setting, the secondary structure state is class with highest membership value l Unrealistic structures may be present l Popular methods of structure filtration å Neural Network å Heuristic based
Heuristic Filter 1. Smoothen the memberships values 2. Filter unrealistic structures l Helix > 3 amino acids, -sheet > 2 amino acids 3. Calculate the thresholds to filter noise 4. Mark the possible Helix and Sheet regions l Resolve conflicts based on average membership value in overlap region 5. Fill the rest of the structure with Coil
Filter: Final Structure Unfiltered CCCCCHCCCCCHHHHHHHHCCCCCCEEEEECCCCCCCCCCCCCEEEEEECCCCCCHHHCCCCC Target CCCHHHCCCCHHHHHHHHHHHCCCCEEEEEECCCCEECCCCCCEEEEEEECCCCEECCCCEEC Filtered CCHHHHCCCHHHHHHHHHHHHHCCCEEEEEECCCCCCCCCCCCEEEEEEECCCCCCCCCCCCC
Metrics l Seven commonly used metrics å Q 3 = Number of correctly predicted residues x 100 Total number of residues å Q = Number of residues correctly predicted X100 Total number of residues in å Matthew’s Correlation Coefficient MCC = where, p – true positives n – true negatives u – false negatives o – false positives
Results Q 3 (%)Q H (%)Q E (%)Q C (%)MHMH MEME MCMC Unfiltered Filtered Performance on database of 1973 proteins (<25% sequence identity) generated by the PISCES 1 server 1. G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19: , 2003.
Relative Performance MethodAccuracy MBR NN NNSSP PFKNN X. Zhang, J. P. Mesirov and D.L Waltz. Hybrid system for Protein Secondary Structure Prediction. J. Mol. Biol., 225: , Tau-Mu Yi and E. S. Lander. Protein Secondary Structure Prediction using Nearest-Neighbor Methods. J. Mol. Biol., 232: , A. A. Salamov and V. V. Solovyev. Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithm and Multiple Sequence Alignments. J. Mol. Biol., 247:11-15, 1995
Summary l A novel approach for PSSP å Evolutionary information å K-Nearest Neighbor algorithm å Fuzzy set theory l Most accurate KNN approach to date l Easily expandable l Accuracy increases with new structures l Average computing time < 1 min on a single CPU machine
Future Work l System with faster search capabilities å Efficient search for neighbors l Accurate prediction system
Acknowledgements l Dr. James Keller for insight into the Fuzzy K-Nearest Neighbor Algorithm l Oak Ridge National Laboratory for providing the supercomputing facilities l Members of Digital Biology Laboratory for their support
Software The enhanced version of the software is coded in C and is available upon request. Please your requests to or
Thank you for Participation!