PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),

Slides:



Advertisements
Similar presentations
Predicting Kinase Binding Affinity Using Homology Models in CCORPS
Advertisements

11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and.
Ligand Binding Site Prediction for HIV-1 Protease using Shape Comparison Techniques Manasi Jahagirdar 1, Vivek K Jalahalli 2, Sunil Kumar 1, A. Srinivas.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Bioinformatics “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” SheAvery
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.
Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Protein Tertiary Structure Prediction
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.
PART II. Prediction of functional regions within disordered proteins Zsuzsanna Dosztányi MTA-ELTE Momentum Bioinformatics Group Department of Biochemistry.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Protein-Protein Interaction Hotspots Carved into Sequences Yanay Ofran 1,2, Burkhard Rost 1,2,3 1.Department of Biochemistry and Molecular Biophysics,
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
How to start to write a scientific paper Ashgan Mohamed, Ph.D Assistant Professor Cairo University.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Bioinformatics and Computational Biology
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
 Developed Struct-SVM classifier that takes into account domain knowledge to improve identification of protein-RNA interface residues  Results show that.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.
The Chemistry of Protein Catalysis John Mitchell University of St Andrews.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Results for all features Results for the reduced set of features
Evaluating classifiers for disease gene discovery
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Combining HMMs with SVMs
Predicting Active Site Residue Annotations in the Pfam Database
A Unifying View on Instance Selection
Protein Sequence Analysis - Overview -
Protein Sequence Analysis - Overview -
Phosphopeptides identified harboring minimal binding motifs
Volume 18, Issue 11, Pages (March 2017)
Phosphopeptides identified harboring minimal binding motifs
Presentation transcript:

PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department), Cathy H. Wu (Protein Information Resource, Georgetown University, Departments of Biochemistry and Molecular Biology and Oncology ) We present a method for the prediction of catalytic residues in proteins using machine learning techniques. We found the best- performing machine learning algorithm (support vector classifier, SMO), and relevant features of protein residues for the prediction of catalytic residues using benchmarking dataset of enzymes with known catalytic sites. This method can predict catalytic residues and 3D location of the active site with an accuracy > 86% for proteins with unknown function, provided that the structure of the protein is known. ABSTRACT REFERENCES CONCLUSIONS SMO (the support vector classifier) found to be the best performing algorithm (among tested) for the prediction of catalytic residues (4). 8 attributes out of 24 were selected as relevant. As anticipated, the selection of the attributes did improve the performance of the SMO classifier (5). We measured the algorithm accuracy of prediction without each individual attribute present and found that no attribute can be excluded from the final list without reduction in the performance of SMO classifier (5). In order to train a machine learning algorithm we used the benchmarking dataset which is a subset of the “Catalytic Residue Dataset” database. Every protein from the benchmarking dataset is a member of a manually curated protein family of PIR iProClass database. The dataset has 254 catalytic residues from 79 proteins out of 178 enzymes from Catalytic Residues Dataset (1). Using “Catalytic Residue Database” we decided to build a dataset, where each instance would be represented as a list of attribute values and a class label {+1 / -1}, which in this case would be an indicator of the residue being catalytic (+1) or not (-1). Each attribute in this dataset is a property of the protein residues. The list of attributes was chosen based mostly on work of Bartlett et al., and other authors who pointed out the importance of particular residue property (2). Since for the complex dataset it is almost impossible to know a priory which classification algorithm is going to perform better, our first goal was to determine one of the best performing algorithms among machine learning techniques built in WEKA, JAVA-software package (3, 4). Different authors seem to focus on different features of the protein in order to predict catalytic residues. Therefore, we found relevant features of the protein residues for the prediction of catalytic residues using our benchmarking dataset of enzymes with known catalytic sites and machine learning attribute selection algorithm – “Wrapper” (5). The selection of the attributes combined with best-performing algorithm was used to build a model for the prediction of catalytic residues (6). One of the major goals of proteomics is to assign a function to every protein. The knowledge of the protein function is a key to determining the role it plays in the cell. The number of proteins, whose functions have been experimentally characterized, is growing linearly every year. Experimental data provide reliable (in most cases) information about protein functional residues as well as possible mechanism of protein function. Furthermore, analytical methods used for experimental characterization of protein function involve many man-hours. It is true that it can be reduced by either improving the existing or, perhaps, by the development of new methods in experimental biology. But, since the sizes of the protein sequence and protein structure databases are growing exponentially, the gap between experimentally characterized and uncharacterized proteins is also growing exponentially. As a result, two major groups of computational methods are progressively developing: homology transfer of known experimental data, and prediction of protein function using various properties of proteins and amino acids. Prediction of the functional residues is a challenging and interesting task. The results of such prediction could be successfully used in many research areas such as drug design, experimental biology, and protein database annotations. INTRODUCTION METHODS RESULTS EXAMPLES OF PREDICTION BENCHMARKING DATASET Catalytic Residues: C125, H375, C403, G405 Acetyl-coA Acetyltransferase, 1afw Catalytic Residues: R23, N41 Acylphosphatase, 2acy The performance of a support vector classifier suggests that the linear separation using one dimension, corresponding to one feature, is not sufficient for the prediction of catalytic residues. Reduction of the number of the attributes increases the prediction accuracy of SMO algorithm 8 out of 24 attributes are selected as relevant for the prediction of catalytic residues SMO algorithm trained on the dataset, represented by the selected attributes has: Prediction Accuracy : > 86% TP Rate: 0.898% FP Rate: 0.126% GenBank database statistics, PDB database statistics, Bartlett G.J., Porter C.T., Borkakoti N., Thornton J.M. Analysis of Catalytic Residues in Enzyme Active Sites. J. Mol. Biol., 324: , 2002 Campbell S. J., Gold N. D., Jackson R. M., Westhead D. R., Ligand binding: functional site location, similarity and docking. Current Opinion in Structural Biology, 13: , 2003 Sjolander K., Karplus K., Brown M., Hughey R., `Krogh A., Mian S., Haussler D., Dirichlet Mixtures: A Method for Improved Detection of Weak but Significant Protein Sequence Homology, 1996 Smith D. K., Radivojac P., Obradovic Z., A. Keith Dunker A. K., Zhu G., Improved amino acid flexibility parameters. Protein Science, 12: , 2003 Model PRELIMINARY ATTRIBUTE SET FINAL ATTRIBUTE SET BEST-PERFORMING CLASSIFIER – ‘SMO’ WRAPPER ATTRIBUTE SELECTION ALGORITHM SMO is the best performing algorithm (among tested) for the prediction of catalytic residues * ACKNOWLEDGEMENTS This work would not have been complete without the wise help, and guidance that was provided by our colleagues at PIR: Hongzhan Huang, Ph.D. (PIR: Team Lead, Bioinformatics and Research Assistant Professor) Sona Vasudevan,Ph.D. (PIR: Senior Bioinformatics Scientist) C.R. Vinayaka, Ph.D. (PIR: Senior Research Scientist)C.R. Vinayaka, Ph.D. True Positive (TP): red False Positive (FP): yellow True Positive (TP): red False Positive (FP): blue