Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hierarchical multilabel classification trees for gene function prediction Leander Schietgat Hendrik Blockeel Jan Struyf Katholieke Universiteit Leuven.

Similar presentations


Presentation on theme: "Hierarchical multilabel classification trees for gene function prediction Leander Schietgat Hendrik Blockeel Jan Struyf Katholieke Universiteit Leuven."— Presentation transcript:

1 Hierarchical multilabel classification trees for gene function prediction Leander Schietgat Hendrik Blockeel Jan Struyf Katholieke Universiteit Leuven (Belgium) Amanda Clare University of Aberystwyth (Wales) Sašo Džeroski Jožef Stefan Institute Ljubljana (Slovenia) Probabilistic Modeling and Machine Learning in Structural and Systems Biology Tuusula, Finland, 17-18 June 2006

2 Overview The application gene function prediction The machine learning context hierarchical multilabel classification Decision trees for HMC the algorithm: Clus-HMC Experimental results Conclusions 2/21 PMSB 2006

3 Gene Function Prediction Task Given a data set with descriptions of genes and the functions they have Learn a model that can predict for a new gene what functions it performs Genes can have multiple functions These functions are hierarchically organised 3/21 PMSB 2006 c1c1 c3c3 c2c2 c 21 c 22

4 Machine Learning Classifier predicts for unseen instances the class to which they belong learned with already classified training examples Different techniques decision trees support vector machines bayesian networks … 4/21 PMSB 2006

5 Hierarchical Multilabel Classification Normal classification setting only predicts a single class HMC predict multiple classes at once classes are organized in a hierarchy Hierarchy constraint instances of a class must be instances of its superclasses 5/21 PMSB 2006

6 Two HMC approaches 1. Learn model for each class and combine the predictions Advantage a lot of machine learning algorithms available Disadvantages efficiency skewed class distributions hierarchical relationships 6/21 PMSB 2006 m1m1 m2m2 mnmn c1?c1?c2?c2? cn?cn? … …

7 Two HMC approaches (c’ted) 2. Learn a single model that predicts all the classes together Advantages faster to learn easier to interpret hierarchy constraint automatically imposed selection of features relevant for all classes Disadvantage may have worse predictive performance M [c 1, c 2, …, c n ] 7/21 PMSB 2006

8 Related work on HMC Barutcuoglu et al. (2006) learn classes separately with SVM’s and combine the predictions with Naïve Bayes Clare (2003) extension of C4.5 decision tree method that learns all classes together A lot of work in the area of text classification Rousu et al. (2005) give an overview on SVM-methods that learn a single model for all classes PMSB 2006 Gene function predictionText classification Approach 1Barutcuoglu et al.… Approach 2Clare… 8/21

9 Why decision trees? fast to build fast to use accurate predictions easy to interpret Gene ND HS … MF? G1 25 29 … ̶ G2 32 40 … + G3 19 0 … ̶ G4 44 45 … + … … … … … Nitrogen depletion <= -2.74? Heat shock > 1.28? yesno yes no training examples 9/21 PMSB 2006 + + + ̶ ̶ + + + + + ̶ ̶ ̶ ̶ Positive Negative

10 Decision trees for HMC The Clus system created by Jan Struyf propositional DT learner, implemented in Java uses ideas of: C4.5 [Quinlan93] and CART [Breiman84] Predictive Clustering Trees [Blockeel98] Heuristic for HMC look for test that minimizes the intra-cluster variance (= generalisation of CART) PMSB 2006 10/21

11 can be used for HMC (Clus-HMC) … … as well as binary classification (Clus-SC ~ CART) Decision trees for HMC (c’ted) … 2n1 c1?c1?c2?c2?cn?cn? c1c1 c 1,c 21,c 22 c 2,c 21,c 22 c1c1 c 1,c 2,c 21 c 1,c 3 PMSB 2006 11/21

12 Saccharomyces cerevisiae or baker’s/brewer’s yeast MIPS FunCat hierarchy 250 functions of yeast genes 12 datasets [Clare03] Sequence structure (seq) Phenotype growth (pheno) Secondary structure (struc) Homology search (hom) Microarray data cellcycle, church, derisi, eisen, gasch1, gasch2, spo, expr (all) Experiments in yeast functional genomics 1 METABOLISM 1/1 amino acid metabolism 1/2 nitrogen and sulfur metabolisms … 2 ENERGY 2/1 glycolysis and gluconeogenesis … 12/21 PMSB 2006

13 Example run each leaf contains multiple classes which classes to predict? problem: different class frequencies use of threshold precision-recall curves: independent of a specific threshold PMSB 2006 nitrogen_depletion > 5 Name A 1 A 2 … A n 1 … 5 5/1 … 40 40/3 40/16 … G1 … … … … x x x x x G2 … … … … x x x x G3 … … … … x x G4 … … … … x x x G5 … … … … x x x G6 … … … … x x x … … … … … … … … description functions 13/21 37C_to_25C_shock > 1.28 {1,5,5/1,3,3/5} {5,5/1,40,40/3} {1,5} {40,40/3,40/16} {5,5/1,40} {40,40/3, 40/16} {1,5,5/1,3, 3/5} {1,5} {5,5/1,40} {5,5/1,40, 40/3} {40,40/16} {5,5/1,40} 40,40/3,40/165,5/1,40,40/31,5,5/1,3,3/5 p=0% 40,40/3,40/16 5,5/1,401,5 p=50% 40,40/165,5/1,401,5 p=100% Predictions

14 Comparison of Clus-HMC with [Clare03] Average precision-recall curves PMSB 2006 14/21 PRECISION = proportion of (instance, class) predictions that is correct RECALL = proportion of true (instance, class) cases that are predicted

15 Extracting rules e.g. predictions for class 40/3 in “gasch1” dataset IF Nitrogen_Depletion_8_h <= -2.74 AND Nitrogen_Depletion_2_h > -1.94 AND 1point5_mM_diamide_5_min > -0.03 AND 1M_sorbitol___45_min_ > -0.36 AND 37C_to_25C_shock___60_min > 1.28 THEN 40,40/3 Precision: 0.97 Recall: 0.15 PMSB 2006 15/21

16 HMC vs. single classification Tree sizes on average HMC tree: 24 nodes SC tree: 33 nodes (250 of such trees) Time to grow trees single SC tree is grown faster than single HMC but 250 single trees have to be built HMC on average 37 times faster Predictive performance next slide PMSB 2006 16/21

17 HMC vs. single classification Average precision-recall curves PMSB 2006 17/21

18 Explanation of the results The classes are not independent different trees for different classes actually share structure explains some complexity reduction achieved by Clus-HMC one class carries information on other classes this increases the signal-to-noise ratio provides better guidance when learning the tree (explaining good predictive performance) avoids overfitting (explaining further reduction of tree size) this was confirmed empirically PMSB 2006 18/21

19 Conclusions HMC decision trees are a useful tool for gene function prediction fast to learn high interpretability Compared to regular tree learning, HMC tree learning: is even faster yields trees that: are smaller are easier to interpret have equal or better predictive performance PMSB 2006 19/21

20 Further work Comparison to other HMC learning algorithms kernel methods studied by Rousu et al. and Barutcuoglu et al. other suggestions are welcome! Use more advanced hierarchy such as Gene Ontology thousands of classes, spread over 19 levels how to handle the part_of relationship? if a function A is part-of a function B then does a gene with function A also have function B? gene “has” function B X vs. gene “is involved” in function B PMSB 2006 20/21 cellular component catalytic activity molecular functionbiological process … 3-isopropylmalate dehydratase activity cell … cytosol physiological process leucine biosynthesis leucine metabolism branched chain family amino acid biosynthesis amino acid metabolism GO branched chain family amino acid metabolism … … … … …… …… …… … … … … …… … ……… … … amino acid biosynthesis

21 Questions? PMSB 2006 21/21


Download ppt "Hierarchical multilabel classification trees for gene function prediction Leander Schietgat Hendrik Blockeel Jan Struyf Katholieke Universiteit Leuven."

Similar presentations


Ads by Google