50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues in proteins using machine learning techniques Natalia V. Petrova and Cathy H. Wu Protein Information Resource, Georgetown University, Washington, DC Contact The growing gap between experimentally characterized and uncharacterized proteins necessitates the development of new computational methods for functional prediction. Although computational methods to predict catalytic residues/active sites are rapidly developing, their accuracy remains low ( %), with a significant number of false positives. We present a novel method for the prediction of catalytic sites, using a machine learning approach, and analyze the results in a case study of a large evolutionarily diverse group of proteins. INTRODUCTION We used a dataset of enzymes with experimentally identified catalytic sites (79 enzymes) from the CATRES database as our benchmarking dataset [Table 1 ] for the initial analysis. In the 10-fold cross-validation analysis, the best result was achieved with SMO [Figure 1 ] – a support vector machine algorithm that builds an optimal hyperplane in the multidimensional space of attributes in order to achieve maximal separation of the positively and negatively labeled samples. The Scorecons conservation score is a key attribute in the prediction, as shown by the performance of the SMO algorithm using individual attributes [Figure 2 ]. Seven out of 24 attributes were chosen by the Wrapper Subset Selection algorithm as an optimal subset of attributes for the SMO algorithm, and no further reduction of the set is possible [Table 2 ]. METHODS & RESULTS ACKNOWLEDGEMENTS: This work would not have been complete without the wise help and guidance that was provided by our colleagues at PIR: W. C. Barker, H. Huang, A. Nikolskaya, S. Vasudevan, and C.R. Vinayaka. CONCLUSIONS 1. The prediction accuracy of our method is > 86% [Table 2 ]. 2. An additional analytical step correctly identified the catalytic triad of hydrolases and reduced false positives to 1.06%. 3. The method can be used to identify candidate catalytic residues for proteins with known structure but unknown function. CASE STUDY – hydrolases (Further Optional Analytical Step) The prediction capabilities of our method were tested on a diverse superfamily of hydrolytic enzymes with hydrolase fold and different catalytic functions (Pfam domain – PF00561). All enzymes have a catalytic triad and conserved structural features [Figure 3A ]. Even though the algorithm predicted a large number of false positives for each individual protein [Figure 3C ], further improvement can be achieved by merging the results for a group of related proteins. For 16 out of 17 enzymes, the method correctly predicted all 3 residues of the triad with 3 false positives (1.06%) out of 282 residues on average. For one protein, 1cv2, the method missed a substituted catalytic residue of the triad [Figure 3 ]. Two out of 3 false positive residues (His and Gly ) are believed to be important for enzymatic activity, while all three of them are essential for protein structural stability. Table 1 Benchmarking Dataset 5.1% ligases 2.5% isomerases 17.7% lyases 27.8% hydrolases 26.6% transferases 20.3% oxidoreductases EC number 1.3% small proteins 48.1% 30.4% 10.1% all 10.1% all SCOP 100% X-ray crystallographyPDB 100% curatedPIR 254 # catalytic residues 23,664 # residues 79 # proteins Algorithm Performance Measurements 0, guessing 1, all correct MCC = Figure 1 Performance of each algorithm measured by MCC LID Aligned Regions Catalytic Triad, True Positives (TP) Consensus of False Positives (FP) Not Aligned Regions – length (FP) Figure 3 Prediction Results for 9 Curated Protein Families of Hydrolases False Negative (FN) Asp/Asn His Gly Sep/Asp His Asp Glu A C B Table 2 Final Attribute Set CITATION: Petrova NV, Wu CH: Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties. BMC Bioinformatics, 7:312,