Download presentation
Presentation is loading. Please wait.
Published byVeronika Gunawan Modified over 6 years ago
1
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Sequence and Structure based Protein Peptide Binding Residue Prediction Suraj Gattani§, Avdesh Mishra§, Md Tamjidul Hoque* §These authors contributed equally to this work. *Corresponding author. {sggattan, amishra2, Department of Computer Science, University of New Orleans, New Orleans, LA, USA Introduction Results Protein-peptide interactions are one of the most important type of biological interactions, which play a significant role in many cellular signaling and other cellular processes. Protein interactions have some special properties like promiscuity – multiple bindings due to conformational flexibility of peptides. Protein-peptide complex structures can be predicted using docking techniques. We develop a computational method for the prediction of protein-peptide binding residues through the inclusion of both sequence and structure based features. Explored features are: Position specific scoring matrix (PSSM) profile Accessible surface area (ASA) and secondary structure probabilities (SS) Half sphere exposure (HSE) and torsion angles (phi and psi) Close neighbor correlation coefficient (CNCC) Monogram and bigram Terminal indicator, sPSEE and disorder probability Optimized output was obtained using Extra Tree Classifier. Figure 1: Figure 3: Datasets Figure 4: Comparison between Extra Tree Classifier, Gradient Boosting Classifier, and Logistic Regression methods based on sensitivity, specificity, accuracy and MCC for window size 3. Figure shows that the Extra Tree provides higher accuracy among all other methods. . Figure 2: The Protein-peptide dataset has a total 261,797 residues in which 14,187 residues are binding and rest are non-binding. We made a balanced dataset by randomly choosing equal number of non- binding residues as binding residues. The final balanced dataset, used for 10-fold cross-validation, consist of 14,187 binding and 14,187 non-binding residues. Feature Extraction PSSM scores We obtained a 20-dimensional Position-Specific Scoring Matrix (PSSM) for each amino acid using PSI-BLAST program. CNCC - Close Neighbor Correlation Coefficient Close Neighbor Correlation Coefficient is calculated from the 20-dimensional PSSM scores. CNCC = 𝑗 𝑃 𝑖,𝑗 𝑃 𝑘𝑗 𝑃 𝑖𝑗 𝑗 𝑃 𝑘𝑗 2 Where, k is the sliding window. Accessible Surface Area and Secondary Structure These two features were collected by running Spider 3.0 tool. We also collected half-sphere exposure (HSE) i.e.; HSE_up and HSE_down, as well as phi and psi angles from Spider 3.0. Dispredict_V2.0 Features Monogram and bigram were collected from Dispredict_V2.0 software which in total provided us 21 features. We also collected 14 other features from Dispredict_V2.0 such as: sPSEE, angle fluctuations, 7 physio-chemical properties, etc. Residue-wise Contact Energy Matrix It is the contact potential energy among the 20 different amino acids. Provides 20 features per amino acid. Methods Table 1: Results for Extra Tree Classifier and Bagging Classifier for Window Size 3 Performance of all the machine learning methods were examined using 10-fold cross-validation approach on the balanced dataset. Feature ranking was carried out using maximum relevance, minimum redundancy (mRMR) method. Out of 100 features, 91 features with highest mRMR score were finally used for validation of the proposed method. Explored Machine Learning Methods are: Gradient Boosting Classifier Bagging Classifier Logistic Regression Extra Tree Classifier Support Vector Machine Parameters Extra Tree Classifier-WS3 Bagging Classifier-WS3 MCC 33.5 33.2 Accuracy 66.7 66.6 Sensitivity 65.2 64.9 Specificity 68.2 Acknowledgements Discussions Conclusions Future Study Among the methods compared, Extra Tree Classifier and Bagging Classifier with Window Size-3 gave the best accuracy and sensitivity. For all other methods, either the specificity or the sensitivity was better, but the overall accuracy was comparatively lower. Incremental feature selection was used to identify the usefulness of individual features in prediction accuracy. We have developed a method for protein-peptide binding prediction from sequence using various useful features. Feature windowing technique is found to be useful in improving the overall accuracy of the classifier. Evaluation shows that sensitivity and accuracy are comparable with the existing state-of-the-art technique. Optimization of Support Vector Machine (SVM) with radial basis function (RBF) takes longer time to converge. Thus, SVM simulations are still under process. We would also like to run other potential machine learning methods to identify the best performing one. We plan to add more features to improve the accuracy of our method. We gratefully acknowledge the Louisiana Board of Regents through the Board of Regents Support Fund, LEQSF ( )-RD-B-07
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.