Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence- and structure- based features employed
Importance of protein-protein interactions (PPIs) Crucial for the understanding of the biological pathways, like cell signalling PPI dysfunctions may lead to disease situations Important targets for therapy Angshuman Bagchi –
Aim of the Present Research To extract features of PPIs from known PP hetero- complex structures and thereby to predict PPIs with their help using machine learning tools To build machine learning (Support Vector Machine and Random Forest) classifiers with the help of the training dataset To set up an online server to predict PPI residues from protein sequence and structural information To build a web service plug-in for UCSF Chimera to visualize the PPI residues
Angshuman Bagchi – Overview of Support Vector Machine (SVM) A support vector machine (SVM) is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
Angshuman Bagchi – Overview of Random Forest (RF) A Random Forest (RF) is an ensemble classifiers that consists of many decision trees. Given a set of training examples, it generates random decision trees. The output of the tree is the class which has got the maximum votes. RF has the ability to give estimates of the importance of the variables. It efficiently handles the problem of missing data..
Angshuman Bagchi – Assumptions – employed Surface residue: An amino acid with its accessible surface area (ASA) > 15% of its total area Interface residue: A surface residue with at least one heavy atom located within a distance of 5Å from any of the heavy atoms of its interacting partner Dataset: 274 high resolution X-ray hetero-complex structure files with interface residues (+ve) and non-interface surface residues (-ve) (Jo-Lan et al., Proteins, 2006) Features Sequence based: Obtained from sequence conservations using PSI-BLAST Structure based (2ndary Structure, Charge, Solvent accessibility, B-factor etc.): Obtained using S-BLEST (Mooney et al., Proteins, 2005), DSSP (Kabasch & Sander, Biopolymers, 1983), PDB files
Angshuman Bagchi – Development of PPI predictor Dataset Sequence Based Structure Based The dataset was divided into the following two categories with equal number of PPI (positive) and non-PPI (negative) examples. This balanced dataset was used for the training purposes.
Angshuman Bagchi – Development of PPI predictor-Continued The RF package in R and the LibSVM package were used to implement separate RF and SVM predictors using each of the aforementioned datasets with 10-fold cross-validation. Two SVM predictors, one using a linear kernel and the other using a Radial Basis Function (RBF) kernel, were created from each dataset. Throughout the experiments, the default values of the regularization parameter (C) and γ for linear and RBF kernel SVM were used. For RF, we generated 1000 trees keeping other parameters to their default values.
Angshuman Bagchi – Rank & DescriptionAUC B-factor 0.91 PSSM 0.85 Frequency of Lys residues in a 20 amino acid sequence window 0.83 Solvent accessibility 0.80 Number of neighboring charged residues (Arg, Asp, Glu, Lys) 0.78 Acidic residue 0.75 Atomic charge 0.71 Hydrophobicity 0.70 Best features ranked on the basis of their AUC AUC: Area under Receiver Operating Characteristics (ROC) Curve
Angshuman Bagchi – MethodAccuracy (%)Sensitivity (%)Specificity (%)AUC SVM linear SVM RBF RF Machine learning results TPR = True Positive Rate, FPR = False Positive Rate The dataset used is sequence (interface residues as positives and all non-interface surface and core residues as negatives)
Angshuman Bagchi – MethodAccuracy (%)Sensitivity (%)Specificity (%)AUC SVM linear SVM RBF RF Machine learning results-continued MethodAccuracy (%)Sensitivity (%)Specificity (%)AUC SVM linear SVM RBF RF The dataset used is structure (interface residues as positives and non-interface surface residues as negatives) The dataset used is sequence (interface residues as positives and non-interface surface residues as negatives)
Case Study Top-scoring amino acid residues from the crystal structure of the antibody N10-staphylococcal nuclease complex (PDB ID: 1NSN). The backbone of the antibody N10 is presented in black whereas the staphylococcal nuclease is shown as surface in cyan. The top scoring amino acid residues are highlighted. Angshuman Bagchi –
Conclusion We have developed and evaluated several classification models (RF, SVM-linear & -RBF) for identifying PPI interfaces using both a combination of sequence- & structure-based features as well as only sequence-based features. The wider application of our classifier could have important consequences for the prediction, prognosis and treatment of inherited disease states brought about by disruption of PPI sites. Since we have developed a sequence-only predictor for PPI interface prediction, our method can be used by researchers to have a quick idea about the probable function of the protein for which no structures are available. Finally, we have constructed a web resource that can be used for the prediction of PPI sites using either sequence alone, or structure and sequence together. This resource can be found at Angshuman Bagchi –
