Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Slides:

Advertisements

Similar presentations

Data Mining For Credit Card Fraud: A Comparative Study

Advertisements

ECG Signal processing (2)

Machine learning continued Image source:

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Improved prediction of protein-protein binding sites using a support vector machine ( James Bradford, et al (2004)) Tapan Patel CISC841 Trypsin (and inhibitor.

SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.

Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Identifying Computer Graphics Using HSV Model And Statistical Moments Of Characteristic Functions Xiao Cai, Yuewen Wang.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.

Friday 17 rd December 2004Stuart Young Capstone Project Presentation Predicting Deleterious Mutations Young SP, Radivojac P, Mooney SD.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Prediction of HIV-1 Drug Resistance: Representation of Target Sequence Mutational Patterns via an n-Grams Approach Majid Masso School of Systems Biology,

Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer.

Computational prediction of protein-protein interactions Rong Liu

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia.

Associating Biomedical Terms: Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac.

Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.

Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.

PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.

Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering.

GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.

Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.

Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.

Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.

Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.

We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.

Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.

Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Basic machine learning background with Python scikit-learn

Evaluating classifiers for disease gene discovery

Support Vector Machines (SVM)

Feature Extraction Introduction Features Algorithms Methods

Introduction Feature Extraction Discussions Conclusions Results

LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS

Prediction of RNA Binding Protein Using Machine Learning Technique

Machine Learning Week 1.

Extra Tree Classifier-WS3 Bagging Classifier-WS3

חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף

Support Vector Machine (SVM)

Machine Learning to Predict Experimental Protein-Ligand Complexes

Machine Learning with Clinical Data

Machine Learning for Cyber

Presentation transcript:

Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence- and structure- based features employed Angshuman Bagchi, Ph.D Assistant Professor of Biochemistry Department of Biochemistry and Biophysics University of Kalyani Formerly postdoctoral fellow in Buck Institute, Stanford University, California, USA Purdue University, Indianapolis, USA

Importance of protein-protein interactions (PPIs) Crucial for the understanding of the biological pathways, like cell signalling PPI dysfunctions may lead to disease situations Important targets for therapy Angshuman Bagchi –

Aim of the Present Research To extract features of PPIs from known PP hetero- complex structures and thereby to predict PPIs with their help using machine learning tools To build machine learning (Support Vector Machine and Random Forest) classifiers with the help of the training dataset To set up an online server to predict PPI residues from protein sequence and structural information To build a web service plug-in for UCSF Chimera to visualize the PPI residues

Angshuman Bagchi – Overview of Support Vector Machine (SVM) A support vector machine (SVM) is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

Angshuman Bagchi – Overview of Random Forest (RF) A Random Forest (RF) is an ensemble classifiers that consists of many decision trees. Given a set of training examples, it generates random decision trees. The output of the tree is the class which has got the maximum votes. RF has the ability to give estimates of the importance of the variables. It efficiently handles the problem of missing data..

Angshuman Bagchi – Assumptions – employed Surface residue: An amino acid with its accessible surface area (ASA) > 15% of its total area Interface residue: A surface residue with at least one heavy atom located within a distance of 5Å from any of the heavy atoms of its interacting partner Dataset: 274 high resolution X-ray hetero-complex structure files with interface residues (+ve) and non-interface surface residues (-ve) (Jo-Lan et al., Proteins, 2006) Features Sequence based: Obtained from sequence conservations using PSI-BLAST Structure based (2ndary Structure, Charge, Solvent accessibility, B-factor etc.): Obtained using S-BLEST (Mooney et al., Proteins, 2005), DSSP (Kabasch & Sander, Biopolymers, 1983), PDB files

Angshuman Bagchi – Development of PPI predictor Dataset Sequence Based Structure Based The dataset was divided into the following two categories with equal number of PPI (positive) and non-PPI (negative) examples. This balanced dataset was used for the training purposes.

Angshuman Bagchi – Development of PPI predictor-Continued The RF package in R and the LibSVM package were used to implement separate RF and SVM predictors using each of the aforementioned datasets with 10-fold cross-validation. Two SVM predictors, one using a linear kernel and the other using a Radial Basis Function (RBF) kernel, were created from each dataset. Throughout the experiments, the default values of the regularization parameter (C) and γ for linear and RBF kernel SVM were used. For RF, we generated 1000 trees keeping other parameters to their default values.

Angshuman Bagchi – Rank & DescriptionAUC B-factor 0.91 PSSM 0.85 Frequency of Lys residues in a 20 amino acid sequence window 0.83 Solvent accessibility 0.80 Number of neighboring charged residues (Arg, Asp, Glu, Lys) 0.78 Acidic residue 0.75 Atomic charge 0.71 Hydrophobicity 0.70 Best features ranked on the basis of their AUC AUC: Area under Receiver Operating Characteristics (ROC) Curve

Angshuman Bagchi – MethodAccuracy (%)Sensitivity (%)Specificity (%)AUC SVM linear SVM RBF RF Machine learning results TPR = True Positive Rate, FPR = False Positive Rate The dataset used is sequence (interface residues as positives and all non-interface surface and core residues as negatives)

Angshuman Bagchi – MethodAccuracy (%)Sensitivity (%)Specificity (%)AUC SVM linear SVM RBF RF Machine learning results-continued MethodAccuracy (%)Sensitivity (%)Specificity (%)AUC SVM linear SVM RBF RF The dataset used is structure (interface residues as positives and non-interface surface residues as negatives) The dataset used is sequence (interface residues as positives and non-interface surface residues as negatives)

Case Study Top-scoring amino acid residues from the crystal structure of the antibody N10-staphylococcal nuclease complex (PDB ID: 1NSN). The backbone of the antibody N10 is presented in black whereas the staphylococcal nuclease is shown as surface in cyan. The top scoring amino acid residues are highlighted. Angshuman Bagchi –

Conclusion We have developed and evaluated several classification models (RF, SVM-linear & -RBF) for identifying PPI interfaces using both a combination of sequence- & structure-based features as well as only sequence-based features. The wider application of our classifier could have important consequences for the prediction, prognosis and treatment of inherited disease states brought about by disruption of PPI sites. Since we have developed a sequence-only predictor for PPI interface prediction, our method can be used by researchers to have a quick idea about the probable function of the protein for which no structures are available. Finally, we have constructed a web resource that can be used for the prediction of PPI sites using either sequence alone, or structure and sequence together. This resource can be found at Angshuman Bagchi –

Acknowledgement Angshuman Bagchi –