K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

Slides:



Advertisements
Similar presentations
When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.
Advertisements

Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier 1.IBM Research – China 2.IBM T.J.Watson Research Center.
Decision trees for hierarchical multilabel classification A case study in functional genomics.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Yeast Dataset Analysis Hongli Li Final Project Computer Science Department UMASS Lowell.
09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.
Dept. of Computer Science & Engineering, CUHK Pseudo Relevance Feedback with Biased Support Vector Machine in Multimedia Retrieval Steven C.H. Hoi 14-Oct,
3 ème Journée Doctorale G&E, Bordeaux, Mars 2015 Wei FENG Geo-Resources and Environment Lab, Bordeaux INP (Bordeaux Institute of Technology), France Supervisor:
Machine Learning CS 165B Spring 2012
Automatic methods for functional annotation of sequences Petri Törönen.
Gene id GO GO GO Feature_1Feature_2... Feature_t
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Hierarchical multilabel classification trees for gene function prediction Leander Schietgat Hendrik Blockeel Jan Struyf Katholieke Universiteit Leuven.
The Broad Institute of MIT and Harvard Classification / Prediction.
Experimental Evaluation of Learning Algorithms Part 1.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Hierarchical Annotation of Medical Images Ivica Dimitrovski 1, Dragi Kocev 2, Suzana Loškovska 1, Sašo Džeroski 2 1 Department of Computer Science, Faculty.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
CLASSIFICATION: Ensemble Methods
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Hierarchical Classification
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
Ivica Dimitrovski 1, Dragi Kocev 2, Suzana Loskovska 1, Sašo Džeroski 2 1 Faculty of Electrical Engineering and Information Technologies, Department of.
Functional Annotation of Genes Using Hierarchical Text Categorization Svetlana Kiritchenko, Stan Matwin University of Ottawa, Canada and A. Fazel Famili.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Classification And Bayesian Learning
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Using Classification Trees to Decide News Popularity
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Decision Trees for Hierarchical Multilabel Classification : A Case Study in Functional Genomics Hendrik Blockeel 1, Leander Schietgat 1, Jan Struyf 1,2,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Decision trees for hierarchical multi-label classification.
Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.
Combining Bagging and Random Subspaces to Create Better Ensembles
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
David Amar, Tom Hait, and Ron Shamir
Semi-Supervised Clustering
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Classification with Gene Expression Data
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Basic machine learning background with Python scikit-learn
Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah
Introduction Feature Extraction Discussions Conclusions Results
Features & Decision regions
Machine Learning Week 1.
Ensembles for predicting structured outputs
SVM Based Learning System for F-term Patent Classification
Hierarchical, Perceptron-like Learning for OBIE
Evaluating Classifiers for Disease Gene Discovery
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
Presentation transcript:

K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat, Jan Struyf, Hendrik Blockeel, Dragi Kocev, Sašo Džeroski K.U.Leuven Department of Computer Science

Classification: a common machine learning task e.g., Given: genes with known function Task: predict function for new genes Special case: hierarchical multi-label classification (HMC) gene can have multiple functions functions are organized in a hierarchy tree (e.g., MIPS FunCat) DAG (e.g., Gene Ontology) Hierarchy constraint: if gene is labeled with function X, then it is also labeled with all parents of X Hierarchical Multi-Label Classification (HMC) for Gene Function Prediction

K.U.Leuven Department of Computer Science Predictions in Functional Genomics S. cerevisiae (13 datasets) and A. thaliana (12 datasets) two of biology’s model organisms most genes are annotated, ideal for testing purposes method can be applied to other organisms Data based on sequence statistics, phenotype, secondary structure, homology, microarray data,…

K.U.Leuven Department of Computer Science Predictive Clustering Trees Our focus is on decision trees Advantages: fast to build, noise-resistant, fast to apply, accurate predictions, easy to interpret, … General framework: predictive clustering trees (PCTs) PCT-algo genes with features and known functions Name A 1 A 2 … A n 1 … 5 5/1 … 40 40/3 40/16 … G1 … … … … x x x x x G2 … … … … x x x x G3 … … … … x x G4 … … … … x x x G5 … … … … x x x G6 … … … … x x x … … … … … … … … InputAlgorithmOutput top-down induction of PCTs PCT

K.U.Leuven Department of Computer Science Clus-SCClus- HSC Clus-HMC Hierarchy constraint Identifies global feats Predictive performance Model size Efficiency Standard approach learns one tree per class Special-purpose approach learns one tree per class + hierarchy constraint Our approach learns one single tree for all classes Decision Trees for HMC: Different Approaches

K.U.Leuven Department of Computer Science Predictive Clustering Forests 50 predictions 50 bootstrap replicates Training set Ensembles Less interpretability Better performance Algorithm: Clus-HMC-Ens … 1 2 n 3 Clus-HMC 50 PCTs … Test set combined prediction Clus-HMC L1L1 L2L2 L3L3 LnLn L

K.U.Leuven Department of Computer Science Clus-SCClus- HSC Clus-HMCClus-HMC-Ens Hierarchy constraint Identifies global feats Predictive performance Model size Efficiency Standard approach learns one tree per class Special-purpose approach learns one tree per class + hierarchy constraint Our approach learns one single tree for all classes Variant of our approach learns forest Decision Trees for HMC: Different Approaches

K.U.Leuven Department of Computer Science Evaluation: precision-recall precision: percentage of predicted functions that are correct (TP/(TP+FP)) recall: percentage of actual functions predicted by the algorithm (TP/(TP+FN)) Average PR curve –Consider (instance,class) couples –Couple is (predicted) true if instance (is predicted to have) has class Evaluation TPFN FPTN

K.U.Leuven Department of Computer Science S. cerevisiae-FunCat (hom) A. thaliana-GO (seq) S. cerevisiae-FunCat (expr)A. thaliana-GO (interpro) Clus-HMC-Ens better than Clus-HMC (average AUC improvement of 7%) Clus-HMC better than C4.5H (state-of-the-art system for HMC) (for the same recall of C4.5H, average precision improvement of 20.9%)

K.U.Leuven Department of Computer Science

Comparison with SVMs (Barutcuoglu et al.) –Learn SVM per class –Correct for HC violations with bayesian model

K.U.Leuven Department of Computer Science Clus-HMC outperforms (or is comparable to) state-of-the-art methods on functional genomics tasks Ensembles of Clus-HMC are able to boost performance, if the user is willing to give up on interpretability “Revenge of the decision trees” Conclusions