Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah

Slides:



Advertisements
Similar presentations
Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,
Advertisements

Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Face Recognition & Biometric Systems Support Vector Machines (part 2)
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
Sparse vs. Ensemble Approaches to Supervised Learning
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
An Example of Course Project Face Identification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Exploration of Instantaneous Amplitude and Frequency Features for Epileptic Seizure Prediction Ning Wang and Michael R. Lyu Dept. of Computer Science and.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
BLAST FOR GENOMICS BLAST FOR GENOMICS Jianxin Ma Department of Agronomy Purdue University.
Copyright © 2009 Pearson Education, Inc. Art and Photos in PowerPoint ® Concepts of Genetics Ninth Edition Klug, Cummings, Spencer, Palladino Chapter 22.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
BNFO 615 Fall 2016 Usman Roshan NJIT. Outline Machine learning for bioinformatics – Basic machine learning algorithms – Applications to bioinformatics.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Experience Report: System Log Analysis for Anomaly Detection
Avdesh Mishra, Md Tamjidul Hoque {amishra2,
GraDe-SVM: Graph-Diffused Classification for the Analysis of Somatic Mutations in Cancer Morteza H.Chalabi, Fabio Vandin Hello.
Debesh Jha and Kwon Goo-Rak
9th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
An Artificial Intelligence Approach to Precision Oncology
Bulgarian Academy of Sciences
Trees, bagging, boosting, and stacking
Genomes and their evolution
Hybrid Features based Gender Classification
Basic machine learning background with Python scikit-learn
An Enhanced Support Vector Machine Model for Intrusion Detection
Schizophrenia Classification Using
Feature Extraction Introduction Features Algorithms Methods
Introduction Feature Extraction Discussions Conclusions Results
Brain Hemorrhage Detection and Classification Steps
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Support Vector Machine (SVM)
Evolution of eukaryote genomes
Germplasm Issues Chapter 3. Variation: Type, Origin, and Scale
Generalizations of Markov model to characterize biological sequences
Remah Alshinina and Khaled Elleithy DISCRIMINATOR NETWORK
Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee
Somi Jacob and Christian Bach
Reecha Khanal Mentor: Avdesh Mishra Supervisor: Dr. Md Tamjidul Hoque
Genome evolution: Sex and the transposable element
Pooja Pun, Avdesh Mishra, Simon Lailvaux, Md Tamjidul Hoque
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
Results Motivation Introduction Methods Conclusions Acknowledgements
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah The Prediction of the Hierarchical Classification of Transposable Elements using a Machine Learning Approach Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah Emails: amishra2@uno.edu, mpanta1@uno.edu, thoque@uno.edu, jattallah@uno.edu Department of Computer Science & Department of Biological Sciences, University of New Orleans, New Orleans, LA, USA Introduction Transposable Elements (TEs) or jumping genes are DNA sequences that have an intrinsic capability to move within a host genome from one genomic location to another. The new genomic location can either be on the same chromosome or a different chromosome. With the discovery of transposable element in maize by Barbara McClintock in 1948, there are numerous ongoing research efforts on the identification and classification of TEs, along with their effects in the genome. These studies show that TEs have a role in genome function and evolution, as their presence can modify the functionality of genes and increase the size of the genome. Thus, proper classification of the identified jumping genes in a genome is important to understand their particular role in germline and somatic evolution. The classification of TEs is usually based on the mode of transposition, number and type of genes they contain and similarities in sequence. For the hierarchical classification of Transposable Elements, a unified hierarchical classification system proposed by Wicker et al. has been popular. In this classification system, the classes are organized in a tree structure which includes: Class I (retrotransposons) and Class II (DNA transposons). Class I further divided into five orders (LTR, DIRS, LINE, SINE) and Class II into Subclass 1 and Subclass 2. Each order is divided into several superfamilies. In this work, we studied publicly available hierarchical datasets. We developed a machine learning based method to improve prediction of hierarchical classification of transposable elements using support vector machines (SVMs). To generate an effective classifier, we used k-mers as features, a common practice in bioinformatics. Our major contribution is identifying the appropriate advanced machine learning method in the prediction of hierarchical classes of Transposable Elements. Furthermore, we performed a comparative study of different machine learning methods on the datasets. We compared the proposed SVM with the existing methods based on Neural Networks. The comparative results indicate that the proposed method significantly outperforms the state-of-the-art methods. Motivation Results TEs play an important role in modifying functionalities of genes. Hence, proper classification of the identified jumping genes (TEs) in a genome is important to understand their particular role in germline and somatic evolution. The existing machine learning method for hierarchical classification of transposable elements does not have a satisfying f-measure (balanced mean between precision and recall). Table I Table II PGSB Repbase Fasta Sequences 18680 34561 Features 336 Classes Per Level 2 / 4 /3 /5 2 / 5 /12 /9 PGSB - nLLCPN   SVM ANN GBC ExtraTree Random Forest LogReg hP 88.21% 82.13% 86.75% 76.03% 76.98% 76% hR 86.51% 85.51% 86.25% 78.94% 79.55% 78.89% hF 0.873518029 0.837699065 0.864972486 0.774524643 0.782458818 0.774172489 PGSB - LCPNB 87.34% 82.93% 86.11% 84.50% 84.12% 83.55% 86.10% 83.44% 86.45% 85% 84.69% 84.21% 0.867151847 0.831846433 0.862758219 0.847494297 0.844037783 0.838769007 Methods Fasta sequence extraction: DNA sequences are available in the public repositories of repetitive DNA sequences. The repositories are the Repbase and PGSB (Plant Genome and Systems Biology) repeat element databases. Feature Encoding: K-mers are often used as features in bioinformatics. Here, frequency counting of substrings has been used as features. For each TE (collected from two public repositories, Repbase and PGSB), all k-mers with sizes k=2,3,4 are extracted. Therefore, the total number of features used is 336.  The dataset is then organized as a hierarchical dataset with classes per level for each TEs. The dataset has been extracted from the work that has been completed and available publicly. Hierarchical classification methods: The hierarchical classification methods are based on local approach. Two top-down strategies for the hierarchical classification of TEs have been used. The approaches are non-Leaf Local Classifier per Parent Node (nLLCP) and Local Classifier per Parent Node and Branch (LCPNB). nLLCP allows non-leaf node classification with a multi-class classifier to each internal node of the hierarchy and learns to distinguish among its subclasses. LCPNB allows correction of possible mistakes at a higher level as the final classification is given by the highest average probability of the path to the leaf node. Application of Machine Learning Methods: Different Machine Learning methods are used in order to determine the best approach for the prediction. The following classifiers are used for the prediction: Support Vector Machines, Gradient Boosting Classifier, Neural Networks, Extra Tree, Random Forest, LogReg, and Bagging. 3-fold cross validation strategy is used and the average hf (balanced mean between precision and recall - hierarchical f-measure) over 3 iterations is reported. Table and Figure Legends Table I – Dataset Statistics. PGSB is a public repository of available plant repeat sequences. Repbase is the public repository of repetitive DNA sequences from different eukaryotic species. Table II – Comparative results of different machine learning approaches in the PGSB hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch. Fig.1. – Hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in PGSB dataset Fig.2 – Hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in Repbase dataset Fig.3 – Comparison of hierarchical precision (hP) and hierarchical recall (hR) between Machine Learning approaches ANN and proposed SVM for two different classification methods (nLLCPN and LCPNB) in the PGSB dataset Fig.4 – Comparison of hierarchical precision (hP) and hierarchical recall (hR) between Machine Learning approaches ANN and proposed SVM for two different classification methods (nLLCPN and LCPNB) in the Repbase dataset Discussion Table I represents the total number of instances and features extracted as K-mer frequency. The PGSB dataset contains la lower number of TE than the Repbase dataset. Table II presents the results of hP (hierarchical precision), hR (hierarchical recall) and hF (hierarchical f-measure) obtained by different machine learning methods in PGSB and Repbase datasets for two different hierarchical classification approaches. The state of the art method used artificial neural network(ANN) for the hierarchical classification of TEs. We analyzed the performance of six different classifiers, SVM, GBC, Random Forest, ExtraTree, ANN, and LogReg. For PGSB dataset, Fig.1 and Fig.3 and for Repbase dataset, Fig.2 and Fig.4 represents the higher precision, recall and f-measure for both the classification methods (nLLCPN and LCPNB) for our proposed SVM. Our proposed machine learning approach SVM, with optimized parameters for both the classification methods, generated better balanced-accuracy. Conclusions and Future Work Proper classification of Transposable Elements (TEs) is crucial in identifying their roles in genome Machine Learning generates rapid annotation of the likeliest class of the transposable elements Advanced Machine Learning approach improves the prediction accuracy in the hierarchical classification of TEs Optimization of the cost and gamma parameters of support vector machine (SVM) with radial basis function (RBF) kernel leads to a better hierarchical classification of transposable elements. In the future, we would like to explore different features, including advanced machine learning techniques and hierarchical classification approaches. Acknowledgements References We gratefully acknowledge the Louisiana Board of Regents through two Board of Regents Support Funds: LEQSF (2016-19)-RD-B-07 & LEQSF(2017-20)-RD-A-26. Start-up funds from the University of New Orleans to Joel Atallah also provided support for this project. [1] Nakano, Felipe Kenji, et al. "Top-down strategies for hierarchical classification of transposable elements with neural networks." Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 2017. [2] Wicker, Thomas, et al. "A unified classification system for eukaryotic transposable elements." Nature Reviews Genetics8.12 (2007): 973. [3] Melsted, Pall, and Jonathan K. Pritchard. "Efficient counting of k-mers in DNA sequences using a bloom filter." BMC bioinformatics 12.1 (2011): 333.