PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department), Cathy H. Wu (Protein Information Resource, Georgetown University, Departments of Biochemistry and Molecular Biology and Oncology ) We present a method for the prediction of catalytic residues in proteins using machine learning techniques. We found the best- performing machine learning algorithm (support vector classifier, SMO), and relevant features of protein residues for the prediction of catalytic residues using benchmarking dataset of enzymes with known catalytic sites. This method can predict catalytic residues and 3D location of the active site with an accuracy > 86% for proteins with unknown function, provided that the structure of the protein is known. ABSTRACT REFERENCES CONCLUSIONS SMO (the support vector classifier) found to be the best performing algorithm (among tested) for the prediction of catalytic residues (4). 8 attributes out of 24 were selected as relevant. As anticipated, the selection of the attributes did improve the performance of the SMO classifier (5). We measured the algorithm accuracy of prediction without each individual attribute present and found that no attribute can be excluded from the final list without reduction in the performance of SMO classifier (5). In order to train a machine learning algorithm we used the benchmarking dataset which is a subset of the “Catalytic Residue Dataset” database. Every protein from the benchmarking dataset is a member of a manually curated protein family of PIR iProClass database. The dataset has 254 catalytic residues from 79 proteins out of 178 enzymes from Catalytic Residues Dataset (1). Using “Catalytic Residue Database” we decided to build a dataset, where each instance would be represented as a list of attribute values and a class label {+1 / -1}, which in this case would be an indicator of the residue being catalytic (+1) or not (-1). Each attribute in this dataset is a property of the protein residues. The list of attributes was chosen based mostly on work of Bartlett et al., and other authors who pointed out the importance of particular residue property (2). Since for the complex dataset it is almost impossible to know a priory which classification algorithm is going to perform better, our first goal was to determine one of the best performing algorithms among machine learning techniques built in WEKA, JAVA-software package (3, 4). Different authors seem to focus on different features of the protein in order to predict catalytic residues. Therefore, we found relevant features of the protein residues for the prediction of catalytic residues using our benchmarking dataset of enzymes with known catalytic sites and machine learning attribute selection algorithm – “Wrapper” (5). The selection of the attributes combined with best-performing algorithm was used to build a model for the prediction of catalytic residues (6). One of the major goals of proteomics is to assign a function to every protein. The knowledge of the protein function is a key to determining the role it plays in the cell. The number of proteins, whose functions have been experimentally characterized, is growing linearly every year. Experimental data provide reliable (in most cases) information about protein functional residues as well as possible mechanism of protein function. Furthermore, analytical methods used for experimental characterization of protein function involve many man-hours. It is true that it can be reduced by either improving the existing or, perhaps, by the development of new methods in experimental biology. But, since the sizes of the protein sequence and protein structure databases are growing exponentially, the gap between experimentally characterized and uncharacterized proteins is also growing exponentially. As a result, two major groups of computational methods are progressively developing: homology transfer of known experimental data, and prediction of protein function using various properties of proteins and amino acids. Prediction of the functional residues is a challenging and interesting task. The results of such prediction could be successfully used in many research areas such as drug design, experimental biology, and protein database annotations. INTRODUCTION METHODS RESULTS EXAMPLES OF PREDICTION BENCHMARKING DATASET Catalytic Residues: C125, H375, C403, G405 Acetyl-coA Acetyltransferase, 1afw Catalytic Residues: R23, N41 Acylphosphatase, 2acy The performance of a support vector classifier suggests that the linear separation using one dimension, corresponding to one feature, is not sufficient for the prediction of catalytic residues. Reduction of the number of the attributes increases the prediction accuracy of SMO algorithm 8 out of 24 attributes are selected as relevant for the prediction of catalytic residues SMO algorithm trained on the dataset, represented by the selected attributes has: Prediction Accuracy : > 86% TP Rate: 0.898% FP Rate: 0.126% GenBank database statistics, PDB database statistics, Bartlett G.J., Porter C.T., Borkakoti N., Thornton J.M. Analysis of Catalytic Residues in Enzyme Active Sites. J. Mol. Biol., 324: , 2002 Campbell S. J., Gold N. D., Jackson R. M., Westhead D. R., Ligand binding: functional site location, similarity and docking. Current Opinion in Structural Biology, 13: , 2003 Sjolander K., Karplus K., Brown M., Hughey R., `Krogh A., Mian S., Haussler D., Dirichlet Mixtures: A Method for Improved Detection of Weak but Significant Protein Sequence Homology, 1996 Smith D. K., Radivojac P., Obradovic Z., A. Keith Dunker A. K., Zhu G., Improved amino acid flexibility parameters. Protein Science, 12: , 2003 Model PRELIMINARY ATTRIBUTE SET FINAL ATTRIBUTE SET BEST-PERFORMING CLASSIFIER – ‘SMO’ WRAPPER ATTRIBUTE SELECTION ALGORITHM SMO is the best performing algorithm (among tested) for the prediction of catalytic residues * ACKNOWLEDGEMENTS This work would not have been complete without the wise help, and guidance that was provided by our colleagues at PIR: Hongzhan Huang, Ph.D. (PIR: Team Lead, Bioinformatics and Research Assistant Professor) Sona Vasudevan,Ph.D. (PIR: Senior Bioinformatics Scientist) C.R. Vinayaka, Ph.D. (PIR: Senior Research Scientist)C.R. Vinayaka, Ph.D. True Positive (TP): red False Positive (FP): yellow True Positive (TP): red False Positive (FP): blue