Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction Feature Extraction Discussions Conclusions Results

Similar presentations


Presentation on theme: "Introduction Feature Extraction Discussions Conclusions Results"— Presentation transcript:

1 Introduction Feature Extraction Discussions Conclusions Results
StackDPPred: A Stacking based Prediction of DNA Binding Proteins from Sequences Pujan Pokhrel, Avdesh Mishra, Md Tamjidul Hoque {ppokhrel, amishra2, Department of Computer Science, University of New Orleans, New Orleans, LA, USA Introduction Machine Learning Methods Identification of DNA binding proteins is one of the most challenging problems in genome annotation. We develop a new stacking based predictor for prediction of DNA binding proteins. Contains 4 learners- (SVM, RF, Logistic Regression and kNN) in base layer Contains SVM in the meta layer. Each base learners and meta-learner are optimized for better predictive performance. Based on both benchmark and independent test dataset, the proposed stacking based machine learning method significantly outperformed existing state-of-the-art approaches. Figure 2: Comparison of ROC curves and AUC scores given by StackDPPred and PSSM-DT on benchmark dataset. StackDPPred performs better than PSSM-DT which is the best performing predictor. We tested the following machine learning methods in various configurations for this study. i) Support Vector Machine (SVM): We used SVM with the radial basis function (RBF) kernel as one of the base- classifiers as well as a meta-classifier. SVM classifies by maximizing the separating hyperplane between two classes and penalizes the instances on the wrong side of the decision boundary using a cost parameter, C. The best values of the parameters of the base-classifier SVM are found to be C = and γ = Likewise, the best values of the parameters of the SVM, used as meta-classifier, are C = 211 and γ = ii) Logistic Regression (LogReg): We used LogReg with L2 regularization as one of the base-classifiers. LogReg measures the relationship between the dependent variable, which is categorical (in our case: a protein being DNA-binding or not), and one or more independent variables by generating an estimation probability using logistic regression. The parameter, C which controls the regularization strength is optimized to achieve the best jackknife validation accuracy using grid search. In our implementation, we found that C = 0.1 results in the best accuracy. iii) Extra Trees (ET) Classifier: We explored extremely randomized tree or ET which is one of the ensemble methods as a base-learner. ET fits a number of randomized decision trees from the original learning sample and uses averaging to improve the predictive accuracy and control over-fitting. We constructed the ET model with 1,000 trees and the quality of a split is measured by the Gini impurity index. iv) Random Decision Forest (RDF) Classifier: We used RDF as one of the methods for base-classifiers. It operates by constructing a multitude of decision trees on various sub-samples of the dataset and outputs the mean prediction of the decision trees to improve the predictive accuracy and control over-fitting. In our implementation of the RDF ensemble learner, we used bootstrap samples to construct 1,000 trees in the forest. v) K Nearest Neighbor (KNN) Classifier: We used KNN classifier as one of the methods for base-classifiers. The KNN operates by learning from the K number of training samples closest in distance to the target point in the feature space. The classification decision is produced based on the majority votes coming from the neighbors. In this work, the value of K is set to 9 and all the neighbors are weighted uniformly. vi) Bagging (BAG) Classifier: We explored bootstrap aggregation or BAG as one of the methods for base-classifiers in this study. The BAG method forms a class of algorithms which builds several instances of a classifier/estimator on random subsets of the original training set and then aggregates their individual predictions to form a final prediction. In this study, the bagging classifier is fit on multiple subsets of data with the repetitions using 1,000 decision trees, and the outputs are combined by weighted averaging. Feature Extraction Figure 1: StackDPPred Framework Position Specific Scoring Matrix based features: PSSM Distance Transformation (PSSM-DT): Calculated from the PSSM matrix and value of d is optimized to give the best performance. Evolutionary Distance Transformation (EDT) Calculated from the PSSM matrix where d is the length of the shortest protein in the sequence. Residue Probing Transformation (RPT) To calculate RPT, 20 probes are employed on the PSSM matrix of each protein to obtain 20*20 matrix using following equation. 2) Residue Wise Contact Energy Matrix based features: A. Residue Wise Contact Energy Matrix Transformation (RCEMT) RCEMT is calculated by summing the value of the contact potential of residues and dividing it by total length. SM1 : SVM + kNN + LogReg + RDF on base layer and SVM on meta layer. SM2: SVM + kNN + LogReg + ET on base layer and SVM on meta layer. SM3: SVM + kNN + LogReg + Bagging on base layer and SVM on meta layer. Table 2: Different configurations of ensemble learning methods tested on the benchmark dataset. Discussions In this study, we constructed some stacked machine learning models and tested how well they perform the task of DNA binding protein prediction from sequence information only. In terms of features, using only two kind of information from protein sequences, stacking based predictors achieved significant accuracy: Position Specific Scoring Matrix based features, and Residue Wise Contact Energy Potential which approximates structural stability of proteins Since the information space is huge, we used various combination of classifiers, so that useful information can be combined from the feature space effectively. The similar performance of various base learners that work on similar ensemble based learning methods (RDF, BAG and ET) show that they acquire similar kind of information from the features to perform classification. Method / Metric Sensitiv ity Specifi city Fall out rate Miss Rate Bal.an ced Accur acy Accura cy Precisio n F1 score MCC SM1 0.911 0.888 0.111 0.088 0.899 0.898 0.799 SM2 0.884 0.115 0.100 0.892 0.891 0.880 0.890 0.783 SM3 0.901 0.882 0.117 0.098 0.879 Figure 3: Comparison of ROC curve and AUC scores given by StackDPPred and PSSM-DT on an independent test dataset. StackDPPred performs better than PSSM-DT which is the highest performing method. Conclusions We propose an improved predictor which combines useful features of: Several possible PSSM features transformations which were found most important to predict protein functions, and Residue-wise Contact Energy Potential to estimate protein stability. The similar performance of different stacking model allows us to draw two conclusions: Stacking of predictors can be performed by combining predictors based on how they work rather than how they correlate to each other when they give similar performances in a dataset, and Carefully selected ensemble based methods significantly outperform traditional machine learning methods. Results Table 1: Various metrics used to evaluate the quality of the predictors. Table 3. Comparisons of StackDPPred with other state-of-the-art methods on benchmark dataset through jackknife validation. Table 3 shows that StackDPPred outperforms all other predictors significantly in terms of all the metrics employed. Table 4. Comparisons of StackDPPred with the state-of-the-art methods on independent dataset, PDB186. Methods ACC Sensitivity Specificity MCC AUC PSSM-DT (imp. %) (8.20%) (6.18%) (10.73%) (13.8%) (1.58%) iDNA-Prot (28.8%) (36.6%) (20.9%) (114.1%) (6.58%) DNA-Prot (40.06%) (32.3%) (49.9%) (206.8%) (11.5%) DNAbinder (42.4%) (95.7%) (25.0%) (240.9%) (46.3%) DNA-BIND (27.8%) (62.2%) (17.21%) %() (27.9%) DBPPred (12.6%) (16.2%) (8.7%) (36.9) (12.2%) StackDPPred (26.7%) (41.5%) (22.1%) (121.3%) (11.7%) Method/Metric ACC Sensitivity Specificity MCC AUC PSSM-DT ( imp. %) (12.50%) (11.24%) (13.9%) (28.5%) (11.56%) iDNA-Prot (imp. %) (19.21%) (8.72%) (37.2%) (59.8%) (24.1%) DNAbinder (22.26%) (37.08%) (10.5%) (70%) (15.95%) DNA-Prot (24.0%) (10.22%) (48.6%) (81.5%) (19.8%) StackDPPred (19.5%) (36.4%) (27.6%) (60.0%) (17.8%) Acknowledgements We gratefully acknowledge the Louisiana Board of Regents through the Board of Regents Support Fund, LEQSF ( )-RD-B-07. Here, imp. stands for improvement Here, imp. stands for improvement Table 4 shows that StackDPPred beats all other machine learning methods on the independent dataset.


Download ppt "Introduction Feature Extraction Discussions Conclusions Results"

Similar presentations


Ads by Google