Reecha Khanal Mentor: Avdesh Mishra Supervisor: Dr. Md Tamjidul Hoque

Slides:



Advertisements
Similar presentations
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Advertisements

Bioinformatics “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” SheAvery
Structural bioinformatics
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
Predicting Protein Interactions HERPES! Team Question Mark Jeff Brown Dante Kappotis Robert Vanderley Anthony Biasella.
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
Predicting Matchability - CVPR 2014 Paper -
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Protein Tertiary Structure Prediction
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Associating Biomedical Terms: Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
1 Web Site: Dr. G P S Raghava, Head Bioinformatics Centre Institute of Microbial Technology, Chandigarh, India Prediction.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Identification of Helix-Turn-Helix (HTH) DNA-Binding Motifs
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
COMP24111: Machine Learning Ensemble Models Gavin Brown
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.
Kelci J. Miclaus, PhD Advanced Analytics R&D Manager JMP Life Sciences
Experience Report: System Log Analysis for Anomaly Detection
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.
Source: Procedia Computer Science(2015)70:
COMP61011 : Machine Learning Ensemble Models
Three lessons I learned
Cost-Sensitive Learning
Feature Extraction Introduction Features Algorithms Methods
Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah
Introduction Feature Extraction Discussions Conclusions Results
Brain Hemorrhage Detection and Classification Steps
Prediction of RNA Binding Protein Using Machine Learning Technique
CIKM Competition 2014 Second Place Solution
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Support Vector Machine (SVM)
Cost-Sensitive Learning
CIKM Competition 2014 Second Place Solution

KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Machine Learning to Predict Experimental Protein-Ligand Complexes
iSRD Spam Review Detection with Imbalanced Data Distributions
Biochemistry Vocabulary
Basic Local Alignment Search Tool
Somi Jacob and Christian Bach
謝孫源 (Sun-Yuan Hsieh) 成功大學 電機資訊學院 資訊工程系
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Lecture 10 – Introduction to Weka
Detection of Sand Boils using Machine Learning Approaches
Predicting Loan Defaults
Pooja Pun, Avdesh Mishra, Simon Lailvaux, Md Tamjidul Hoque
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
Results Motivation Introduction Methods Conclusions Acknowledgements
Outlines Introduction & Objectives Methodology & Workflow
Presenter: Donovan Orn
Presentation transcript:

Effective Prediction of RNA-Binding Proteins Using Machine Learning Techniques Reecha Khanal Mentor: Avdesh Mishra Supervisor: Dr. Md Tamjidul Hoque University of New Orleans Department of Computer Science 4/24/2019

Presentation Overview Brief Background Objective Dataset Collection Feature Extraction Methods Performance Evaluation Conclusions 4/24/2019

Brief Background Amino Acids: Organic Compounds with key elements of carbon(C), hydrogen(H), oxygen(O), and Nitrogen(N) 20 Standard Amino Acids Proteins: Sequence of Amino Acids 4/24/2019

Brief Background Amino Acids: Organic Compounds with key elements of carbon(C), hydrogen(H), oxygen(O), and Nitrogen(N) 20 Standard Amino Acids Proteins: Sequence of Amino Acids 4/24/2019

Objective To develop a computational approach for prediction of RNA Binding Proteins (RBP) from Non-RNA Binding Proteins (NRBP). 4/24/2019

Why RNA Binding Protein Prediction? RNA Binding Proteins are important in many biological functions: mRNA stability Stress response Cell cycle Tumor Differentiation RNA Binding Proteins have been linked to some critical diseases: Cancer Neuro-degenerative diseases Immunological disorders Muscular atrophies and Metabolic disorders 4/24/2019

NON-RBP Protein Chains Data Collection PISCES UniProt Database Sequence Identity : 25% 68084 RBPs X-ray resolution: 3Å Sequence Length: 50 – 10,000 amino acids 14389 NON-RBP Protein Chains 4/24/2019

Data Collection CD-HIT: Sequence Identity Cutoff >= 25% 14389 Protein Chains 68084 RBPs CD-HIT: Sequence Identity Cutoff >= 25% Protein Length: 50 – 10,000 amino acids 7077 NonRBPs 2770 RBPs 4/24/2019

Feature Extraction 4/24/2019

Composition, Transition, and Distribution (C-T-D) 4/24/2019

Feature Set 1 … 2602 2603 12.496 0.579 97.89 13.587 0.6786 96.65 . 8.78678 0.123123 56.87 7.78927 0.061447 60.761 4/24/2019

Incremental Feature Selection Removed 21 features 2582 features were selected out of 2603 features Incremental Feature Selection Removed 1257 features 1346 features were selected out of 2603 features Genetic Algorithm 4/24/2019

Machine Learning 4/24/2019

Machine Learning Approaches Comparison of various machine learning algorithms on the benchmark dataset through 10-fold CV. Metric/ Methods Bagging K Nearest Neighbours Logistic Regression Random Forest Extreme Gradient Boosting Classifier (XGBC) Extremely Randomized Trees (ET) Sensitivity (%) 82.18 57.54 82.00 72.24 89.09 67.44 Specificity (%) 96.84 89.17 96.39 98.47 97.48 98.58 Balanced Accuracy (%) 89.51 73.35 89.20 85.36 93.28 83.01 Accuracy (%) 92.68 80.19 92.31 91.03 95.10 89.75 False Positive Rate 0.032 0.108 0.036 0.015 0.025 0.014 False Negative Rate 0.178 0.425 0.180 0.278 0.109 0.326 Precision (%) 91.14 67.77 90.00 94.92 93.34 94.96 F1-score 0.866 0.622 0.858 0.820 0.912 0.789 MCC 0.816 0.492 0.807 0.775 0.878 0.742 4/24/2019

Stacking: 4/24/2019

Stacking Models SM1: RDF, XGBC, LogReg, KNN in base-level and XGBoost in meta- level, SM2: Bag, XGBC, LogReg, KNN in base-level and XGBoost in meta-level and SM3: ET, XGBC, LogReg, KNN in base-level and XGBoost in meta-level. 4/24/2019

Performance Comparison of Stacked Models Metric/Methods SM1 SM2 SM3 Sensitivity (%) 90.17 89.99 90.53 Specificity (%) 97.44 97.15 97.29 Balanced Accuracy (%) 93.80 93.57 93.91 Overall Accuracy (%) 95.38 95.12 False Positive Rate 0.026 0.028 0.027 False Negative Rate 0.098 0.100 0.095 Precision (%) 93.31 92.59 92.98 F1-score 0.917 0.912 MCC 0.885 0.879 4/24/2019

Stacking: 4/24/2019

FrameWork: 4/24/2019

Performance Comparison Metric/Methods RBPPred AIRBP (% imp.) Sensitivity (%) 83.07 90.17 (8.55%) Specificity (%) 96.00 97.44 (1.50%) Balanced Accuracy(%) - 93.80 (-) Overall Accuracy (%) 92.36 95.38 (3.26%) False Positive Rate 0.026 (-) False Negative Rate 0.098 (-) Precision (%) 89.00 93.31 (4.84%) F1-score 0.859 0.917 (6.75%) MCC 0.808 0.885 (9.53%) 4/24/2019

Performance Comparison on Test Dataset Methods Dataset Evaluation Metrics Sensitivity (%) Specificity (%) Balanced Accuracy (%) Overall Accuracy (%) FPR FNR Precision (%) F1-score MCC RBPPred Human 84.28 96.65 - 89.00 97.65 0.905 0.788 S. cerevisiae 86.16 94.59 87.73 96.52 0.910 0.729 A. thaliana 86.40 87.02 0.925 0.537 AIRBP (% imp.) 92.14 (9.32%) 94.52 (-2.21%) 93.33 (-) 93.04 (4.54%) 0.055 0.079 96.53 (-1.14%) 0.943 (4.19%) 0.855 (8.50%) 94.35 (9.51%) 84.33 (-10.85%) 89.34 91.60 (4.41%) 0.157 0.057 94.09 (-2.52%) 0.942 (3.52%) 0.789 (8.23%) 92.11 (6.61%) 86.11 (-8.97%) 89.11 91.67 (5.34%) 0.139 98.82 (4.28%) 0.953 (3.03%) 0.594 (10.61%)   (avg. % imp.) (8.48%) (-7.34%) (4.76) (0.21%) (3.58%) (9.11%)

4/24/2019