50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues.

Slides:

Advertisements

Similar presentations

Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,

Advertisements

Predicting Kinase Binding Affinity Using Homology Models in CCORPS

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.

Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering,

Pfam(Protein families )

Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery.

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.

Strict Regularities in Structure-Sequence Relationship

Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Mining frequent patterns in protein structures: A study of protease families Dr. Charles Yan CS6890 (Section 001) ST: Bioinformatics The Machine Learning.

Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

Protein analysis and proteomics Friday, 27 January 2006 Introduction to Bioinformatics DA McClellan

Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.

Evaluating alignments using motif detection Let’s evaluate alignments by searching for motifs If alignment X reveals more functional motifs than Y using.

Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.

Jeremy Wyatt Thanks to Gavin Brown

Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.

Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Protein Tertiary Structure Prediction

RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.

CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.

Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.

PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.

From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.

Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.

SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?

An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical.

Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC PIR: a comprehensive resource for functional.

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.

Protein-Protein Interaction Hotspots Carved into Sequences Yanay Ofran 1,2, Burkhard Rost 1,2,3 1.Department of Biochemistry and Molecular Biophysics,

Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.

Protein Classification Using Averaged Perceptron SVM

PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),

Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

 Developed Struct-SVM classifier that takes into account domain knowledge to improve identification of protein-RNA interface residues  Results show that.

A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.

March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.

Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.

Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.

Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Experiments: Three data sets : Ecoli, Yeast, Fly Evaluate each classifier using 5-fold cross validation Results: Feature selection (wrapper model) improves.

Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.

Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.

Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.

Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert.

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S

Results for all features Results for the reduced set of features

Demo: Protein Information Resource

Evaluating classifiers for disease gene discovery

Introduction Feature Extraction Discussions Conclusions Results

Predicting Active Site Residue Annotations in the Pfam Database

Support Vector Machine (SVM)

Finding Functionally Significant Structural Motifs in Proteins

Presentation transcript:

50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues in proteins using machine learning techniques Natalia V. Petrova and Cathy H. Wu Protein Information Resource, Georgetown University, Washington, DC Contact The growing gap between experimentally characterized and uncharacterized proteins necessitates the development of new computational methods for functional prediction. Although computational methods to predict catalytic residues/active sites are rapidly developing, their accuracy remains low ( %), with a significant number of false positives. We present a novel method for the prediction of catalytic sites, using a machine learning approach, and analyze the results in a case study of a large evolutionarily diverse group of proteins. INTRODUCTION We used a dataset of enzymes with experimentally identified catalytic sites (79 enzymes) from the CATRES database as our benchmarking dataset [Table 1 ] for the initial analysis. In the 10-fold cross-validation analysis, the best result was achieved with SMO [Figure 1 ] – a support vector machine algorithm that builds an optimal hyperplane in the multidimensional space of attributes in order to achieve maximal separation of the positively and negatively labeled samples. The Scorecons conservation score is a key attribute in the prediction, as shown by the performance of the SMO algorithm using individual attributes [Figure 2 ]. Seven out of 24 attributes were chosen by the Wrapper Subset Selection algorithm as an optimal subset of attributes for the SMO algorithm, and no further reduction of the set is possible [Table 2 ]. METHODS & RESULTS ACKNOWLEDGEMENTS: This work would not have been complete without the wise help and guidance that was provided by our colleagues at PIR: W. C. Barker, H. Huang, A. Nikolskaya, S. Vasudevan, and C.R. Vinayaka. CONCLUSIONS 1. The prediction accuracy of our method is > 86% [Table 2 ]. 2. An additional analytical step correctly identified the catalytic triad of  hydrolases and reduced false positives to 1.06%. 3. The method can be used to identify candidate catalytic residues for proteins with known structure but unknown function. CASE STUDY –  hydrolases (Further Optional Analytical Step) The prediction capabilities of our method were tested on a diverse superfamily of hydrolytic enzymes with  hydrolase fold and different catalytic functions (Pfam domain – PF00561). All enzymes have a catalytic triad and conserved structural features [Figure 3A ]. Even though the algorithm predicted a large number of false positives for each individual protein [Figure 3C ], further improvement can be achieved by merging the results for a group of related proteins. For 16 out of 17 enzymes, the method correctly predicted all 3 residues of the triad with 3 false positives (1.06%) out of 282 residues on average. For one protein, 1cv2, the method missed a substituted catalytic residue of the triad [Figure 3 ]. Two out of 3 false positive residues (His and Gly ) are believed to be important for enzymatic activity, while all three of them are essential for protein structural stability. Table 1 Benchmarking Dataset 5.1% ligases 2.5% isomerases 17.7% lyases 27.8% hydrolases 26.6% transferases 20.3% oxidoreductases EC number 1.3% small proteins 48.1%  30.4%  10.1% all  10.1% all  SCOP 100% X-ray crystallographyPDB 100% curatedPIR 254 # catalytic residues 23,664 # residues 79 # proteins Algorithm Performance Measurements 0, guessing 1, all correct MCC = Figure 1 Performance of each algorithm measured by MCC LID Aligned Regions Catalytic Triad, True Positives (TP) Consensus of False Positives (FP) Not Aligned Regions – length (FP) Figure 3 Prediction Results for 9 Curated Protein Families of  Hydrolases False Negative (FN) Asp/Asn His Gly Sep/Asp His Asp Glu A C B Table 2 Final Attribute Set CITATION: Petrova NV, Wu CH: Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties. BMC Bioinformatics, 7:312,