Support Vector Machine (SVM)

Slides:



Advertisements
Similar presentations
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Advertisements

(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
ECG Signal processing (2)
Pattern Recognition and Machine Learning
Protein Backbone Angle Prediction with Machine Learning Approaches by R Kang, C Leslie, & A Yang in Bioinformatics, 1 July 2004, vol 20 nbr 10 pp
50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Standard electrode arrays for recording EEG are placed on the surface of the brain. Detection of High Frequency Oscillations Using Support Vector Machines:
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Predicting Protein Solvent Accessibility with Sequence, Evolutionary Information and Context-based Features 12/05/2013 Ashraf Yaseen Department of Mathematics.
Protein Tertiary Structure Prediction
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation Algorithm Yen-Jen Oyang Department of Computer Science.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
Cysteine Oxidation Prediction Program (COPP): A New Software Program That Predicts Reversible Protein Cysteine Thiol Oxidation Reactions Ricardo Sanchez,
Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert.
A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Constructing a Predictor to Identify Drug and Adverse Event Pairs
PREDICT 422: Practical Machine Learning
Avdesh Mishra, Md Tamjidul Hoque {amishra2,
How to forecast solar flares?
Avdesh Mishra, Md Tamjidul Hoque {amishra2,
Project 4: Facial Image Analysis with Support Vector Machines
Hybrid Features based Gender Classification
Feature Extraction Introduction Features Algorithms Methods
Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah
Introduction Feature Extraction Discussions Conclusions Results
Brain Hemorrhage Detection and Classification Steps
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Ligand Docking to MHC Class I Molecules
Molecular Modeling By Rashmi Shrivastava Lecturer
Predicting the Energetics of Conformational Fluctuations in Proteins from Sequence: A Strategy for Profiling the Proteome  Jenny Gu, Vincent J. Hilser 
Homology Modeling.
Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee
Protein structure prediction.
Support Vector Machines
Usman Roshan CS 675 Machine Learning
Reecha Khanal Mentor: Avdesh Mishra Supervisor: Dr. Md Tamjidul Hoque
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Pooja Pun, Avdesh Mishra, Simon Lailvaux, Md Tamjidul Hoque
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
Results Motivation Introduction Methods Conclusions Acknowledgements
Presentation transcript:

Support Vector Machine (SVM) A Machine Learning based Approach for Disulfide Bond Prediction Avdesh Mishra, Md Tamjidul Hoque email: {amishra2, thoque}@uno.edu Department of Computer Science, University of New Orleans, New Orleans, LA, USA Results Motivation Accurate prediction of disulfide bonds can help improve the accuracy of ab initio protein structure prediction (aiPSP), since: They impose geometrical constraints on the protein backbone which greatly reduces the search space We are motivated to apply the results from disulfide bond prediction to improve the accuracy of our existing ab initio protein structure prediction method, called 3DIGARS-PSP. Table 2: Name and definition of the performance measures. Table 3: Performance of individual cysteine bonding prediction obtained by SVM based machine learning method. Performance Measures Definition Recall/Sensitivity (%) 𝑇𝑃 𝑇𝑃+𝐹𝑁 Specificity (%) 𝑇𝑁 𝑇𝑁+𝐹𝑃 False Positive Rate 𝐹𝑃 𝐹𝑃+𝑇𝑁 False Negative Rate 𝐹𝑁 𝐹𝑁+𝑇𝑃 Precision (%) 𝑇𝑃 𝑇𝑃+𝐹𝑃 F-measure 2𝑇𝑃 2𝑇𝑃+𝐹𝑃+𝐹𝑁 MCC 𝑇𝑃∗𝑇𝑁 −(𝐹𝑃∗𝐹𝑁) 𝑇𝑃+𝐹𝑁 ∗ 𝑇𝑃+𝐹𝑃 ∗ 𝑇𝑁+𝐹𝑃 ∗(𝑇𝑁+𝐹𝑁) Accuracy Balanced (%) 𝑇𝑃+𝑇𝑁 𝐹𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 Accuracy Overall (%) 1 2 ( 𝑇𝑃 𝑇𝑃+𝐹𝑁 + 𝑇𝑁 𝑇𝑁+𝐹𝑃 ) Performance Measures Support Vector Machine (SVM) Recall/Sensitivity (%) 93.15 Specificity (%) 60.70 False Positive Rate 0.393 False Negative Rate 0.068 Precision (%) 83.9 F-measure 0.883 MCC 0.587 Accuracy Balanced (%) 76.93 Accuracy Overall (%) 83.0 Introduction Disulfide bonds are covalent bonds formed during post translational modification by the oxidation of a pair of cysteines. These bonds between cysteines are one of the major forces responsible for Stabilizing protein conformations Post-translational modification Plays an important role in ab initio protein structure prediction (aiPSP) and Protein folding In this study, we established a machine learning based method, for disulfide bond prediction using support vector machine (SVM) For an effective training, various useful features are extracted Conservation profile Solvent accessibility Torsion angle flexibility Disorder probability Sequential distance between cysteines etc. The process of disulfide bonds prediction is carried out in two stages: First, individual cysteines are predicted as either bonding or non-bonding Second, the cysteine-pairs are predicted as either bonding or non-bonding This stage includes the results from individual cysteine bonding as a feature The comparison of our method with the state-of-the-art methods show that the proposed method attains higher prediction accuracy. Figure 1: Shows the best window size of 33, obtained for individual cysteine prediction through 10 fold cross-validation over the dataset of 2303 proteins consisting of 25488 cysteine residues. Table 4: Comparison of the performance of SVM on balanced and imbalanced dataset. Performance Measures SVM Balanced Set SVM Imbalanced Set Recall/Sensitivity (%) 88.76 49.52 Specificity (%) 73.93 97.47 False Positive Rate 0.2607 0.0253 False Negative Rate 0.1124 0.5048 Precision (%) 77.29 79.63 F-measure 0.8263 0.6106 MCC 0.6338 0.5745 Accuracy Balanced (%) 81.34 73.49 Accuracy Overall (%) 89.47 Methods Training Data Sets We collected a dataset of protein sequences consisting of disulfide bonds established previously by Shen et al. This dataset was filtered to remove inconsistencies. Furthermore, dataset of 4120 fasta sequences containing at least one disulfide bond was collected from UniProt database. The fasta sequences from two different sources mentioned above were combined and only the sequences with < 25% sequence similarity were selected as the final set for this study. The final dataset consisted of 2303 non redundant proteins. Next, we created two different datasets A set consisting of balanced number of binding and non binding cysteines A set consisting of binding and non binding cysteines in a ratio of 1:5 Feature Construction The residues of primary protein sequence are encoded by 59 features shown above. For individual cysteine bond prediction, 59 features are used. For cysteine pair prediction, we used total of 61 features; 59 of the features used for individual cysteine prediction and 2 additional features, sequence distance between cysteines and individual cysteine bonding probability. Next the feature windowing is applied to include the neighboring residue features. After feature windowing the absolute values of sum and difference of the features are used to train the machine learning method. Machine Learning Method – Support Vector Machine (SVM) SVM is a machine learning method, which classifies by maximizing the separating hyperplane between two classes and penalizes the instances on the wrong side of the decision boundary using a cost parameter, C. SVM consist of several kernel functions among which we used radial basis function (RBF) as a kernel. The RBF kernel consist of a “gamma” parameter, which is the inverse of the standard deviation, which is used as similarity measure between two points. The RBF kernel parameter, “gamma” and the cost parameter, C are optimized to achieve best accuracy using grid search approach. Figure 2: Shows the best window size of 1, obtained for cysteine pair prediction on balanced dataset, excluding individual CYS probability features through 10 fold cross-validation over the dataset of 2303 proteins consisting of 17574 cysteine pairs. Figure 3: Shows the best window size of 5, obtained for cysteine pair prediction on imbalanced dataset, excluding individual CYS probability features through 10 fold cross-validation over the dataset of 2303 proteins consisting of 17574 cysteine pairs. Table 5: Comparison of the performance of SVM on balanced and imbalanced dataset. Performance Measures SVM Balanced Set SVM Imbalanced Set Recall/Sensitivity (%) 88.67 53.01 Specificity (%) 80.11 97.48 False Positive Rate 0.1989 0.0252 False Negative Rate 0.1135 0.4699 Precision (%) 81.67 80.81 F-measure 0.8503 0.6402 MCC 0.6903 0.603 Accuracy Balanced (%) 84.39 75.25 Accuracy Overall (%) 90.07 ⋯GSMYQLQFINLVYDT⋯ Protein Sequence Residue Profile Amino acid type and Terminal indicator (2 feature) Chemical Profile Polarity score, Secondary structure score, Molecular volume score, Codon diversity score and Electrostatic charge score (5 features) Conservation Profile PSSM scores, Monogram and Bigram (41 features) Structural Profile Secondary structure probability and Accessible Surface Area (7 features) Flexibility Profile Phi angle fluctuation, Psi angle fluctuation and Disorder probability (3 features) Energy Profile Position specific estimated energy score (1 feature) Distance Profile Sequential distance between cysteines Figure 5: Shows the best window size of 5, obtained for cysteine pair prediction on imbalanced dataset, including individual CYS probability features through 10 fold cross-validation over the dataset of 2303 proteins consisting of 17574 cysteine pairs. Figure 4: Shows the best window size of 1, obtained for cysteine pair prediction on balanced dataset, including individual CYS probability features through 10 fold cross-validation over the dataset of 2303 proteins consisting of 17574 cysteine pairs. Comparative Study Based on Features Comparative Study Based on ML-Methods Table 1: Performance of the existing nearest neighbor algorithm (NNA) depending on the features employed to train the model. The accuracies presented in the table above are obtained using Jackknife validation approach on the dataset established previously by Shen et al. but, after filtering the samples. Obsolete proteins as well as samples which did not contain cysteine residues were discarded from further consideration. Performance Measures Features Used in NNA Features Proposed in This Study Sensitivity 45.24 58.08 Specificity 87.34 89.90 Balanced Accuracy 66.29 73.99 Overall Accuracy (%Improvement) 79.38 83.88 (5.68%) Figure 6: Shows the comparison of the proposed SVM based method with the existing NNA based method based on sensitivity, specificity, balanced accuracy and overall accuracy. It is clear from the figure that the proposed method attains an overall accuracy of 90.07% which is 13.48% better than the NNA based method. Discussions Conclusions Acknowledgements We propose an accurate predictor which incorporating novel structural, flexibility and energy features and utilizes optimized machine learning method, called SVM The improved predictor can be utilized to Annotate the sequences whose structure are unknown Can further aid in experimental studies of the disulfide bond and structure determination Improve the prediction accuracy of ab initio protein structure prediction Improve the accuracy of fold recognition Altogether, the proposed predictor achieves an overall improvement of 13.48% in comparison to the stat-of-arts approaches. Prediction of disulfide bonds plays crucial role in ab initio protein structure prediction and protein folding. Improved prediction of disulfide bonds can be useful in improving the accuracy of ab initio protein structure prediction Since they impose geometrical constrains on the protein backbone Thus, can help greatly reduce the search space We propose, disulfide bond prediction from protein sequence. We introduce several novel features Structural profile Flexibility profile Energy profile etc. We carried out optimization of the C and ‘gamma’ parameters of SVM for improved accuracy. Two stage prediction; first, individual cysteine bonding prediction followed by cysteine pair bonding prediction helped improve the accuracy of cysteine pair prediction while using individual cysteine bonding prediction probabilities as features. Our motivation is to apply the results from disulfide bond prediction to improve the accuracy of our existing ab initio protein structure prediction method, called 3DIGARS-PSP. We gratefully acknowledge the Louisiana Board of Regents through the Board of Regents Support Fund, LEQSF (2016-19)-RD-B-07. References Niu, S., Huang, T., Feng, K.Y., He, Z., Cui, W. Inter-and intra-chain disulfide bond prediction based on optimal feature selection. Protein Pept Lett. 2013; 20: 324–35 Mis, A., Hoque, T. Next Generation Evolutionary Sampling and Energy Function Guided ab initio Protein Structure Prediction, Biophysical Journal, DOI: https://doi.org/10.1016/j.bpj.2016.11.335