Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Data Mining Classification: Alternative Techniques
Three-Stage Prediction of Protein Beta-Sheets Using Neural Networks, Alignments, and Graph Algorithms Jianlin Cheng and Pierre Baldi Institute for Genomics.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Secondary Structures
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Instance Based Learning
Garnier-Osguthorpe-Robson
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Structure Prediction in 1D
Introduction to Bioinformatics - Tutorial no. 8 Predicting protein structure PSI-BLAST.
INSTANCE-BASE LEARNING
CS Instance Based Learning1 Instance Based Learning.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Predicting Protein Solvent Accessibility with Sequence, Evolutionary Information and Context-based Features 12/05/2013 Ashraf Yaseen Department of Mathematics.
Protein Tertiary Structure Prediction
Lecture 11, CS5671 Secondary Structure Prediction Progressive improvement –Chou-Fasman rules –Qian-Sejnowski –Burkhard-Rost PHD –Riis-Krogh Chou-Fasman.
Protein Sequence Alignment and Database Searching.
Rising accuracy of protein secondary structure prediction Burkhard Rost
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Some of the slides are adapted from Dr. Dong Xu’s lecture notes.
Protein Secondary Structure Prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand,
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Secondary structure prediction
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Secondary Structure Prediction G P S Raghava.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Meta-learning for Algorithm Recommendation Meta-learning for Algorithm Recommendation Background on Local Learning Background on Algorithm Assessment Algorithm.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
CS 8751 ML & KDDInstance Based Learning1 k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning.
Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Introduction to Bioinformatics II
Yuchun Tang (1), Preeti Singh (1), Yanqing Zhang (1),
Protein Structures.
Generalizations of Markov model to characterize biological sequences
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology Laboratory, Dept. of Computer Science University of Missouri – Columbia, MO 65211, USA

Outline l Introduction å Protein secondary structure prediction å Popular methods å K-Nearest Neighbor method å Fuzzy K-Nearest Neighbor method l Methods l Filtering the prediction l Results and discussion l Summary and Future work

Introduction l Goal: Given a sequence of amino acids, predict in which one of the eight possible secondary structures states {H, G, I, B, E, C, S,T} will each residue fold in to. l CASP convention å {H,G,I} → H å {B,E} → E å {C,S,T} → C l Example: Amino Acid VKDGYIVDXVNCTYFCGRNAYCNEECTKLXGEQWASPYYCYXLPDHVRTKGPGRCH Secondary Structure CEEEEEECCCCCCCCCCCHHHHHHHHHHCCCCEEEECCEEEEECCCCCCCCCCCCC

Protein 3-Dimensional structure

Importance of Secondary Structure l An intermediate step in 3D structure prediction å structure → function l Classification å Ex: α, β, α/β, α+β l Helps in protein folding pathway determination

Existing Methods l Popular Methods å Neural Network methods X Ex: PSIPRED, PHD å Nearest Neighbor methods X Ex: NNSSP å Hidden Markov Model methods

Why K-Nearest Neighbors method? l Methods based on Neural Networks and Hidden Markov models å perform well if the query protein have many homologs in the sequence database å not easily expandable l The 1-Nearest Neighbor rule is bound above by no more than twice the optimal Baye’s error rate [Keller et. al, 1985] l K-NN will work better and better as more and more structures are being solved

K-Nearest Neighbor Algorithm Instances to be classified Classified instances

Instances to be classified Classified instances K-Nearest Neighbor Algorithm

Instances to be classified class B class F

K-Nearest Neighbor Algorithm l Advantages of Nearest Neighbor methods å Simple and transparent model å New structures can be added without re-training å Linear complexity l Disadvantage å Slower compared to other models as processing is delayed until prediction is needed

Why Fuzzy K-NN? l Disadvantages of Crisp K-NN å Atypical examples are given as much as weight as those that truly represent a particular class å Once instance is assigned to a class, there is no indication of its “strength” of its membership in that class

- - - N L G A G N S G L N L G H V A L T F

- - - N L G A N L G A G N S G L N L G H V A L T F

- - - N L G A- - N L G A G N L G A G N S G L N L G H V A L T F

- - - N L G A- - N L G A G- N L G A G N N L G A G N S G L N L G H V A L T F

- - - N L G A- - N L G A G- - N L G A NN L G A G N S N L G A G N S G L N L G H V A L T F

- - - N L G A- - N L G A G-N L G A G N S L G A G N S G N L G A G N S G L N L G H V A L T F

Position Specific Scoring Matrix Length of protein(l) 20 PSI-BLAST... N L G A G N S G L N L G H V A L T F... ARNDCQEGHILKMFPSTWYVARNDCQEGHILKMFPSTWYV

Why Profile-FKNN? l Evolutionary information has been shown to increase the accuracy of secondary structure prediction by many popular methods l An attempt to combine the advantages of incorporating the evolutionary information, fuzzy set theory and nearest neighbor methods

Methods l Calculate profiles using PSI-BLAST å The popular Rost and Sander database of 126 representative proteins (<25% sequence Identity) l Find K-Nearest Neighbors l Calculate the membership values of the neighbors l Calculate the membership values of the current residue l Assign classes l Filter the output

Profile Calculation l The profiles of both the query protein and the test protein are calculated using the program PSI-BLAST l Parameters for PSI-BLAST å Expectation Value (e) = 0.1 å Maximum number of passes (j) = 3 å E-value threshold for inclusion in multi-pass model (h) = 5 å Default values for the rest of the parameters

K-Nearest Neighbors l For each profile-window in the query protein, the position-weighted absolute distance ‘d’ is calculated from all profile-windows of all proteins in the database. l The profile-windows corresponding to K smallest distances are retained as the K-Nearest Neighbors

Distance Calculation N L G A G N S G L T F N L G A G N S G L N L G H V A L T F...

N L G A G N S G L T F N L G A G N S G L N L G H V A L T F... Distance Calculation

N L G A G N S G L T F N L G A G N S G L N L G H V A L T F... Distance Calculation

N L G A G N S G L T F N L G A G N S G L N L G H V A L T F... Distance Calculation

N L G A G N S G L T F N L G A G N S G L N L G H V A L T F... Distance Calculation

N L G A G N S G L T F N L G A G N S G L N L G H V A L T F... Distance Calculation

Membership Values of the Neighbors l The memberships of the nearest neighbors are assigned based on their corresponding secondary structures in various positions in the window l The residues near to the center are weighed more than the residues that are farther away

Membership values of the Neighbors H E C C C E E E C C H = 0 E = 0.200x x x1 = 0.6 C = 0.067x x x x1 = 0.4 C C E E E C CE N L G A G N SA

Membership Value l The membership values of each residue in classes Helix, Sheet and Coil is calculated from the corresponding neighbors using the Fuzzy K-NN algorithm l Each residue is assigned to class in which it has the highest membership value Helix = Sheet = Coil = Final =... C C H H H H C C C C E E E C C C...

Fuzzy K-Nearest neighbor Algorithm BEGIN Initialize i=1. DO UNTIL(r assigned membership in all classes) Compute u i (r) using Increment i. END DO UNTIL END Where, u i = membership value of residue ‘r’ in class ‘i’, i = Helix, Sheet or Coil d(r,r j )= distance between query window centered in residue ‘r’ its j th neighbor m = 2 (Fuzzifier)

Structure Filtration l In the basic setting, the secondary structure state is class with highest membership value l Unrealistic structures may be present l Popular methods of structure filtration å Neural Network å Heuristic based

Heuristic Filter 1. Smoothen the memberships values 2. Filter unrealistic structures l Helix > 3 amino acids,  -sheet > 2 amino acids 3. Calculate the thresholds to filter noise 4. Mark the possible Helix and Sheet regions l Resolve conflicts based on average membership value in overlap region 5. Fill the rest of the structure with Coil

Filter: Final Structure Unfiltered CCCCCHCCCCCHHHHHHHHCCCCCCEEEEECCCCCCCCCCCCCEEEEEECCCCCCHHHCCCCC Target CCCHHHCCCCHHHHHHHHHHHCCCCEEEEEECCCCEECCCCCCEEEEEEECCCCEECCCCEEC Filtered CCHHHHCCCHHHHHHHHHHHHHCCCEEEEEECCCCCCCCCCCCEEEEEEECCCCCCCCCCCCC

Metrics l Seven commonly used metrics å Q 3 = Number of correctly predicted residues x 100 Total number of residues å Q = Number of residues correctly predicted X100 Total number of residues in å Matthew’s Correlation Coefficient MCC = where, p – true positives n – true negatives u – false negatives o – false positives

Results Q 3 (%)Q H (%)Q E (%)Q C (%)MHMH MEME MCMC Unfiltered Filtered Performance on database of 1973 proteins (<25% sequence identity) generated by the PISCES 1 server 1. G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19: , 2003.

Relative Performance MethodAccuracy MBR NN NNSSP PFKNN X. Zhang, J. P. Mesirov and D.L Waltz. Hybrid system for Protein Secondary Structure Prediction. J. Mol. Biol., 225: , Tau-Mu Yi and E. S. Lander. Protein Secondary Structure Prediction using Nearest-Neighbor Methods. J. Mol. Biol., 232: , A. A. Salamov and V. V. Solovyev. Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithm and Multiple Sequence Alignments. J. Mol. Biol., 247:11-15, 1995

Summary l A novel approach for PSSP å Evolutionary information å K-Nearest Neighbor algorithm å Fuzzy set theory l Most accurate KNN approach to date l Easily expandable l Accuracy increases with new structures l Average computing time < 1 min on a single CPU machine

Future Work l System with faster search capabilities å Efficient search for neighbors l Accurate prediction system

Acknowledgements l Dr. James Keller for insight into the Fuzzy K-Nearest Neighbor Algorithm l Oak Ridge National Laboratory for providing the supercomputing facilities l Members of Digital Biology Laboratory for their support

Software The enhanced version of the software is coded in C and is available upon request. Please your requests to or

Thank you for Participation!