Introduction Feature Extraction Discussions Conclusions Results

Slides:

Advertisements

Similar presentations

(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab

Advertisements

Random Forest Predrag Radenković 3237/10

Data Mining Classification: Alternative Techniques

Support Vector Machines

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Sparse vs. Ensemble Approaches to Supervised Learning

Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.

Data mining and statistical learning - lecture 13 Separating hyperplane.

1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.

Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,

Sparse vs. Ensemble Approaches to Supervised Learning

Classification and Prediction: Regression Analysis

Ensemble Learning (2), Tree and Forest

Machine Learning CS 165B Spring 2012

Active Learning for Class Imbalance Problem

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Methods: Bagging and Boosting

CLASSIFICATION: Ensemble Methods

Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.

Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Classification Ensemble Methods 1

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.

BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Avdesh Mishra, Md Tamjidul Hoque {amishra2,

Chapter 7. Classification and Prediction

Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Trees, bagging, boosting, and stacking

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

Basic machine learning background with Python scikit-learn

Machine Learning Basics

Feature Extraction Introduction Features Algorithms Methods

Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah

CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,

Brain Hemorrhage Detection and Classification Steps

Prediction of RNA Binding Protein Using Machine Learning Technique

Combining Base Learners

Extra Tree Classifier-WS3 Bagging Classifier-WS3

Support Vector Machine (SVM)

Instance Based Learning

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Ensemble learning.

Support Vector Machine _ 2 (SVM)

Somi Jacob and Christian Bach

Ensemble learning Reminder - Bagging of Trees Random Forest

Model generalization Brief summary of methods

Reecha Khanal Mentor: Avdesh Mishra Supervisor: Dr. Md Tamjidul Hoque

Predicting Loan Defaults

Physics-guided machine learning for milling stability:

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Pooja Pun, Avdesh Mishra, Simon Lailvaux, Md Tamjidul Hoque

Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah

Support Vector Machines 2

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Outlines Introduction & Objectives Methodology & Workflow

Presentation transcript:

Introduction Feature Extraction Discussions Conclusions Results StackDPPred: A Stacking based Prediction of DNA Binding Proteins from Sequences Pujan Pokhrel, Avdesh Mishra, Md Tamjidul Hoque email: {ppokhrel, amishra2, thoque}@uno.edu Department of Computer Science, University of New Orleans, New Orleans, LA, USA Introduction Machine Learning Methods Identification of DNA binding proteins is one of the most challenging problems in genome annotation. We develop a new stacking based predictor for prediction of DNA binding proteins. Contains 4 learners- (SVM, RF, Logistic Regression and kNN) in base layer Contains SVM in the meta layer. Each base learners and meta-learner are optimized for better predictive performance. Based on both benchmark and independent test dataset, the proposed stacking based machine learning method significantly outperformed existing state-of-the-art approaches. Figure 2: Comparison of ROC curves and AUC scores given by StackDPPred and PSSM-DT on benchmark dataset. StackDPPred performs better than PSSM-DT which is the best performing predictor. We tested the following machine learning methods in various configurations for this study. i) Support Vector Machine (SVM): We used SVM with the radial basis function (RBF) kernel as one of the base- classifiers as well as a meta-classifier. SVM classifies by maximizing the separating hyperplane between two classes and penalizes the instances on the wrong side of the decision boundary using a cost parameter, C. The best values of the parameters of the base-classifier SVM are found to be C = 23.50 and γ = 2-11.25. Likewise, the best values of the parameters of the SVM, used as meta-classifier, are C = 211 and γ = 2-16.25. ii) Logistic Regression (LogReg): We used LogReg with L2 regularization as one of the base-classifiers. LogReg measures the relationship between the dependent variable, which is categorical (in our case: a protein being DNA-binding or not), and one or more independent variables by generating an estimation probability using logistic regression. The parameter, C which controls the regularization strength is optimized to achieve the best jackknife validation accuracy using grid search. In our implementation, we found that C = 0.1 results in the best accuracy. iii) Extra Trees (ET) Classifier: We explored extremely randomized tree or ET which is one of the ensemble methods as a base-learner. ET fits a number of randomized decision trees from the original learning sample and uses averaging to improve the predictive accuracy and control over-fitting. We constructed the ET model with 1,000 trees and the quality of a split is measured by the Gini impurity index. iv) Random Decision Forest (RDF) Classifier: We used RDF as one of the methods for base-classifiers. It operates by constructing a multitude of decision trees on various sub-samples of the dataset and outputs the mean prediction of the decision trees to improve the predictive accuracy and control over-fitting. In our implementation of the RDF ensemble learner, we used bootstrap samples to construct 1,000 trees in the forest. v) K Nearest Neighbor (KNN) Classifier: We used KNN classifier as one of the methods for base-classifiers. The KNN operates by learning from the K number of training samples closest in distance to the target point in the feature space. The classification decision is produced based on the majority votes coming from the neighbors. In this work, the value of K is set to 9 and all the neighbors are weighted uniformly. vi) Bagging (BAG) Classifier: We explored bootstrap aggregation or BAG as one of the methods for base-classifiers in this study. The BAG method forms a class of algorithms which builds several instances of a classifier/estimator on random subsets of the original training set and then aggregates their individual predictions to form a final prediction. In this study, the bagging classifier is fit on multiple subsets of data with the repetitions using 1,000 decision trees, and the outputs are combined by weighted averaging. Feature Extraction Figure 1: StackDPPred Framework Position Specific Scoring Matrix based features: PSSM Distance Transformation (PSSM-DT): Calculated from the PSSM matrix and value of d is optimized to give the best performance. Evolutionary Distance Transformation (EDT) Calculated from the PSSM matrix where d is the length of the shortest protein in the sequence. Residue Probing Transformation (RPT) To calculate RPT, 20 probes are employed on the PSSM matrix of each protein to obtain 20*20 matrix using following equation. 2) Residue Wise Contact Energy Matrix based features: A. Residue Wise Contact Energy Matrix Transformation (RCEMT) RCEMT is calculated by summing the value of the contact potential of residues and dividing it by total length. SM1 : SVM + kNN + LogReg + RDF on base layer and SVM on meta layer. SM2: SVM + kNN + LogReg + ET on base layer and SVM on meta layer. SM3: SVM + kNN + LogReg + Bagging on base layer and SVM on meta layer. Table 2: Different configurations of ensemble learning methods tested on the benchmark dataset. Discussions In this study, we constructed some stacked machine learning models and tested how well they perform the task of DNA binding protein prediction from sequence information only. In terms of features, using only two kind of information from protein sequences, stacking based predictors achieved significant accuracy: Position Specific Scoring Matrix based features, and Residue Wise Contact Energy Potential which approximates structural stability of proteins Since the information space is huge, we used various combination of classifiers, so that useful information can be combined from the feature space effectively. The similar performance of various base learners that work on similar ensemble based learning methods (RDF, BAG and ET) show that they acquire similar kind of information from the features to perform classification. Method / Metric Sensitiv ity Specifi city Fall out rate Miss Rate Bal.an ced Accur acy Accura cy Precisio n F1 score MCC SM1 0.911 0.888 0.111 0.088 0.899 0.898 0.799 SM2 0.884 0.115 0.100 0.892 0.891 0.880 0.890 0.783 SM3 0.901 0.882 0.117 0.098 0.879 Figure 3: Comparison of ROC curve and AUC scores given by StackDPPred and PSSM-DT on an independent test dataset. StackDPPred performs better than PSSM-DT which is the highest performing method. Conclusions We propose an improved predictor which combines useful features of: Several possible PSSM features transformations which were found most important to predict protein functions, and Residue-wise Contact Energy Potential to estimate protein stability. The similar performance of different stacking model allows us to draw two conclusions: Stacking of predictors can be performed by combining predictors based on how they work rather than how they correlate to each other when they give similar performances in a dataset, and Carefully selected ensemble based methods significantly outperform traditional machine learning methods. Results Table 1: Various metrics used to evaluate the quality of the predictors. Table 3. Comparisons of StackDPPred with other state-of-the-art methods on benchmark dataset through jackknife validation. Table 3 shows that StackDPPred outperforms all other predictors significantly in terms of all the metrics employed. Table 4. Comparisons of StackDPPred with the state-of-the-art methods on independent dataset, PDB186. Methods ACC Sensitivity Specificity MCC AUC PSSM-DT (imp. %) 0.8000 (8.20%) 0.8709 (6.18%) 0.7283 (10.73%) 0.6470 (13.8%) 0.8740 (1.58%) iDNA-Prot 0.6720 (28.8%) 0.6770 (36.6%) 0.6670 (20.9%) 0.3440 (114.1%) 0.8330 (6.58%) DNA-Prot 0.6180 (40.06%) 0.6990 (32.3%) 0.5380 (49.9%) 0.2400 (206.8%) 0.7960 (11.5%) DNAbinder 0.6080 (42.4%) 0.5700 (95.7%) 0.6450 (25.0%) 0.2160 (240.9%) 0.6070 (46.3%) DNA-BIND 0.6770 (27.8%) 0.6670 (62.2%) 0.6880 (17.21%) 0.3550 115.1%() 0.6940 (27.9%) DBPPred 0.7690 (12.6%) 0.7960 (16.2%) 0.7420 (8.7%) 0.5380 (36.9) 0.7910 (12.2%) StackDPPred 0.8655 (26.7%) 0.9247 (41.5%) 0.8064 (22.1%) 0.7363 (121.3%) 0.8878 (11.7%) Method/Metric ACC Sensitivity Specificity MCC AUC PSSM-DT ( imp. %) 0.7996 (12.50%) 0.8191 (11.24%) 0.7800 (13.9%) 0.6220 (28.5%) 0.8650 (11.56%) iDNA-Prot (imp. %) 0.7540 (19.21%) 0.8381 (8.72%) 0.6473 (37.2%) 0.5000 (59.8%) 0.7610 (24.1%) DNAbinder 0.7358 (22.26%) 0.6647 (37.08%) 0.8036 (10.5%) 0.4700 (70%) 0.8150 (15.95%) DNA-Prot 0.7255 (24.0%) 0.8267 (10.22%) 0.5976 (48.6%) 0.4400 (81.5%) 0.7890 (19.8%) StackDPPred 0.8996 (19.5%) 0.9112 (36.4%) 0.8880 (27.6%) 0.7990 (60.0%) 0.9449 (17.8%) Acknowledgements We gratefully acknowledge the Louisiana Board of Regents through the Board of Regents Support Fund, LEQSF (2016-19)-RD-B-07. Here, imp. stands for improvement Here, imp. stands for improvement Table 4 shows that StackDPPred beats all other machine learning methods on the independent dataset.