Download presentation
Presentation is loading. Please wait.
Published bySiska Johan Modified over 6 years ago
1
Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah
The Prediction of the Hierarchical Classification of Transposable Elements using a Machine Learning Approach Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah s: Department of Computer Science & Department of Biological Sciences, University of New Orleans, New Orleans, LA, USA Introduction Transposable Elements (TEs) or jumping genes are DNA sequences that have an intrinsic capability to move within a host genome from one genomic location to another. The new genomic location can either be on the same chromosome or a different chromosome. With the discovery of transposable element in maize by Barbara McClintock in 1948, there are numerous ongoing research efforts on the identification and classification of TEs, along with their effects in the genome. These studies show that TEs have a role in genome function and evolution, as their presence can modify the functionality of genes and increase the size of the genome. Thus, proper classification of the identified jumping genes in a genome is important to understand their particular role in germline and somatic evolution. The classification of TEs is usually based on the mode of transposition, number and type of genes they contain and similarities in sequence. For the hierarchical classification of Transposable Elements, a unified hierarchical classification system proposed by Wicker et al. has been popular. In this classification system, the classes are organized in a tree structure which includes: Class I (retrotransposons) and Class II (DNA transposons). Class I further divided into five orders (LTR, DIRS, LINE, SINE) and Class II into Subclass 1 and Subclass 2. Each order is divided into several superfamilies. In this work, we studied publicly available hierarchical datasets. We developed a machine learning based method to improve prediction of hierarchical classification of transposable elements using support vector machines (SVMs). To generate an effective classifier, we used k-mers as features, a common practice in bioinformatics. Our major contribution is identifying the appropriate advanced machine learning method in the prediction of hierarchical classes of Transposable Elements. Furthermore, we performed a comparative study of different machine learning methods on the datasets. We compared the proposed SVM with the existing methods based on Neural Networks. The comparative results indicate that the proposed method significantly outperforms the state-of-the-art methods. Motivation Results TEs play an important role in modifying functionalities of genes. Hence, proper classification of the identified jumping genes (TEs) in a genome is important to understand their particular role in germline and somatic evolution. The existing machine learning method for hierarchical classification of transposable elements does not have a satisfying f-measure (balanced mean between precision and recall). Table I Table II PGSB Repbase Fasta Sequences 18680 34561 Features 336 Classes Per Level 2 / 4 /3 /5 2 / 5 /12 /9 PGSB - nLLCPN SVM ANN GBC ExtraTree Random Forest LogReg hP 88.21% 82.13% 86.75% 76.03% 76.98% 76% hR 86.51% 85.51% 86.25% 78.94% 79.55% 78.89% hF PGSB - LCPNB 87.34% 82.93% 86.11% 84.50% 84.12% 83.55% 86.10% 83.44% 86.45% 85% 84.69% 84.21% Methods Fasta sequence extraction: DNA sequences are available in the public repositories of repetitive DNA sequences. The repositories are the Repbase and PGSB (Plant Genome and Systems Biology) repeat element databases. Feature Encoding: K-mers are often used as features in bioinformatics. Here, frequency counting of substrings has been used as features. For each TE (collected from two public repositories, Repbase and PGSB), all k-mers with sizes k=2,3,4 are extracted. Therefore, the total number of features used is 336. The dataset is then organized as a hierarchical dataset with classes per level for each TEs. The dataset has been extracted from the work that has been completed and available publicly. Hierarchical classification methods: The hierarchical classification methods are based on local approach. Two top-down strategies for the hierarchical classification of TEs have been used. The approaches are non-Leaf Local Classifier per Parent Node (nLLCP) and Local Classifier per Parent Node and Branch (LCPNB). nLLCP allows non-leaf node classification with a multi-class classifier to each internal node of the hierarchy and learns to distinguish among its subclasses. LCPNB allows correction of possible mistakes at a higher level as the final classification is given by the highest average probability of the path to the leaf node. Application of Machine Learning Methods: Different Machine Learning methods are used in order to determine the best approach for the prediction. The following classifiers are used for the prediction: Support Vector Machines, Gradient Boosting Classifier, Neural Networks, Extra Tree, Random Forest, LogReg, and Bagging. 3-fold cross validation strategy is used and the average hf (balanced mean between precision and recall - hierarchical f-measure) over 3 iterations is reported. Table and Figure Legends Table I – Dataset Statistics. PGSB is a public repository of available plant repeat sequences. Repbase is the public repository of repetitive DNA sequences from different eukaryotic species. Table II – Comparative results of different machine learning approaches in the PGSB hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch. Fig.1. – Hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in PGSB dataset Fig.2 – Hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in Repbase dataset Fig.3 – Comparison of hierarchical precision (hP) and hierarchical recall (hR) between Machine Learning approaches ANN and proposed SVM for two different classification methods (nLLCPN and LCPNB) in the PGSB dataset Fig.4 – Comparison of hierarchical precision (hP) and hierarchical recall (hR) between Machine Learning approaches ANN and proposed SVM for two different classification methods (nLLCPN and LCPNB) in the Repbase dataset Discussion Table I represents the total number of instances and features extracted as K-mer frequency. The PGSB dataset contains la lower number of TE than the Repbase dataset. Table II presents the results of hP (hierarchical precision), hR (hierarchical recall) and hF (hierarchical f-measure) obtained by different machine learning methods in PGSB and Repbase datasets for two different hierarchical classification approaches. The state of the art method used artificial neural network(ANN) for the hierarchical classification of TEs. We analyzed the performance of six different classifiers, SVM, GBC, Random Forest, ExtraTree, ANN, and LogReg. For PGSB dataset, Fig.1 and Fig.3 and for Repbase dataset, Fig.2 and Fig.4 represents the higher precision, recall and f-measure for both the classification methods (nLLCPN and LCPNB) for our proposed SVM. Our proposed machine learning approach SVM, with optimized parameters for both the classification methods, generated better balanced-accuracy. Conclusions and Future Work Proper classification of Transposable Elements (TEs) is crucial in identifying their roles in genome Machine Learning generates rapid annotation of the likeliest class of the transposable elements Advanced Machine Learning approach improves the prediction accuracy in the hierarchical classification of TEs Optimization of the cost and gamma parameters of support vector machine (SVM) with radial basis function (RBF) kernel leads to a better hierarchical classification of transposable elements. In the future, we would like to explore different features, including advanced machine learning techniques and hierarchical classification approaches. Acknowledgements References We gratefully acknowledge the Louisiana Board of Regents through two Board of Regents Support Funds: LEQSF ( )-RD-B-07 & LEQSF( )-RD-A-26. Start-up funds from the University of New Orleans to Joel Atallah also provided support for this project. [1] Nakano, Felipe Kenji, et al. "Top-down strategies for hierarchical classification of transposable elements with neural networks." Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 2017. [2] Wicker, Thomas, et al. "A unified classification system for eukaryotic transposable elements." Nature Reviews Genetics8.12 (2007): 973. [3] Melsted, Pall, and Jonathan K. Pritchard. "Efficient counting of k-mers in DNA sequences using a bloom filter." BMC bioinformatics 12.1 (2011): 333.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.