Decision Trees for Hierarchical Multilabel Classification : A Case Study in Functional Genomics Hendrik Blockeel 1, Leander Schietgat 1, Jan Struyf 1,2,

1 Decision Trees for Hierarchical Multilabel Classification : A Case Study in Functional Genomics Hendrik Blockeel 1, Leander Schietgat 1, Jan Struyf 1,2, Saso Dzeroski 3, Amanda Clare 4 1 Katholieke Universiteit Leuven 2 University of Wisconsin, Madison 3 Jozef Stefan Institute, Ljubljana 4 University of Wales, Aberystwyth

2 Overview  The task: Hierarchical multilabel classification (HMC)  Applied to functional genomics  Decision trees for HMC  Multiple prediction with decision trees  HMC decision trees  Experiments  How does HMC tree learning compare to learning multiple standard trees?  Conclusions

3 Classification settings  Normally, in classification, we assign one class label c i from a set C = {c 1, …, c k } to each example  In multilabel classification, we have to assign a subset S  C to each example  i.e., one example can belong to multiple classes  Some applications:  Text classification: assign subjects (newsgroups) to texts  Functional genomics: assign functions to genes  In hierarchical multilabel classification (HMC), the classes C form a hierarchy C,   Partial order  expresses “is a superclass of”

4 Hierarchical multilabel classification  Hierarchy constraint:  c i  c j  coverage(c j )  coverage(c i )  Elements of a class must be elements of its superclasses  Should hold for given data as well as predictions  Straightforward way to learn a HMC model:  Learn k binary classifiers, one for each class  Disadvantages:  1. difficult to guarantee hierarchy constraint  2. skewed class distributions (few pos, many neg)  3. relatively slow  4. no simple interpretable model  Alternative: learn one classifier that predicts a vector of classes  Quite natural for, e.g., neural networks  We will do this with (interpretable) decision trees

5 Goal of this work  There has been work on extending decision tree learning to the HMC case  Multiple prediction trees: Blockeel et al., ICML 1998; Clare and King, ECML 2001; …  HMC trees: Blockeel et al., 2002; Clare, 2003; Struyf et al., 2005  HMC trees were evaluated in functional genomics, with good results (  proof of concept)  But: no comparison with learning multiple single classification trees has been made  Size of trees, predictive accuracy, runtimes…  Previous work focused on the knowledge discovery aspect  We compare both approaches for functional genomics

6 Functional genomics  Task: Given a data set with descriptions of genes and the functions they have, learn a model that can predict for a new gene what functions it performs  A gene can have multiple functions (out of 250 possible functions, in our case)  Could be done with decision trees, with all the advantages that brings (fast, interpretable)… But:  Decision trees predict only one class, not a set of classes  Should we learn a separate tree for each function?  250 functions = 250 trees: not so fast and interpretable anymore! Name A 1 A 2 A 3 ….. A n 1 2 3 4 5 … 249 250 G1 … … … … x x x x G2 … … … … x x x G3 … … … … x x x … … … …. … … … … … … … … … descriptionfunctions … 12250

7 Multiple prediction trees  A multiple prediction tree (MPT) makes multiple predictions at once  Basic idea: (Blockeel, De Raedt, Ramon, 1998)  A decision tree learner prefers tests that yield much information on the “class” attribute (measured using information gain (C4.5) or variance reduction (CART))  MPT learner prefers tests that reduce variance for all target variables together  Variance = mean squared distance of vectors to mean vector, in k-D space Name A 1 A 2 A 3 ….. A n 1 2 3 4 5 … 249 250 G1 … … … … x x x x G2 … … … … x x x G3 … … … … x x x … … … …. … … … … … … … … … descriptionfunction 14,12,105,250 1,5,24,351401,52

8 The algorithm Procedure MPTree(T) returns tree (t*,h*,P*) = (none, ,  ) For each possible test t P = partition induced by t on T h =  Tk  P |T k |/|T| Var(T k ) if (h<h*) and acceptable(t,P) (t*,h*,P*) = (t,h,P) If t* <> none for each Tk  P* tree k = MPTree(T k ) return node(t*,  k {tree k }) Else return leaf(v)

9 HMC tree learning  A special case of MPT learning  Class vector contains all classes in hierarchy  Main characteristics:  Errors higher up in the hierarchy are more important  Use weighted euclidean distance (higher weight for higher classes)  Need to ensure hierarchy constraint  Normally, leaf predicts c i iff proportion of c i examples in leaf is above some threshold t i (often 0.5)  We will let t i vary (see further)  To ensure compliance with hierarchy constraint:  c i  c j  t i  t j  Automatically fulfilled if all t i equal

10 Example c1c1 c2c2 c3c3. c4c4 c5c5 c6c6 c7c7 d 2 (x 1, x 2 ) = 0.25 + 0.25 = 0.5 d 2 (x 1, x 3 ) = 1+1 = 2 x 1 is more similar to x 2 than to x 3 DT tries to create leaves with “similar” examples i.e., relatively pure w.r.t. class sets Weight 1 Weight 0.5 x1: {c 1, c 3, c 5 } = [1,0,1,0,1,0,0] x2: {c 1, c 3, c 7 } = [1,0,1,0,0,0,1] x3: {c 1, c 2, c 5 } = [1,1,0,0,0,0,0] c1c1 c2c2 c3c3. c4c4 c5c5 c6c6 c7c7 c1c1 c2c2 c3c3. c4c4 c5c5 c6c6 c7c7 c1c1 c2c2 c3c3. c4c4 c5c5 c6c6 c7c7 x1x1 x2x2 x3x3

11 Evaluating HMC trees  Original work by Clare et al.:  Derive rules with high “accuracy” and “coverage” from the tree  Quality of individual rules was assessed  No simple overall criterion to assess quality of tree  In this work: using precision-recall curves  Precision = P(pos| predicted pos)  Recall = P(predicted pos | pos)  The P,R of a tree depends on the tresholds t i used  By changing the threshold t i from 1 to 0, a precision-recall curve emerges  For 250 classes:  Precision = P(X | predicted X) [with X any of the 250 classes]  Recall = P(predicted X | X)  This gives a PR curve that is a kind of “average” of the individual PR curves for each class

12 The Clus system  Created by Jan Struyf  Propositional DT learner, implemented in Java  Implements ideas from  C4.5 (Quinlan, ’93)  CART (Breiman et al., ’84)  predictive clustering trees (Blockeel et al., ’98)  includes multiple prediction trees and hierarchical multilabel classification trees  Reads data in ARFF format (Weka)  We used two versions for our experiments:  Clus-HMC: HMC version as explained  Clus-SC: single classification version, +/- CART

13 The datasets  12 datasets from functional genomics  Each with a different description of the genes  Sequence statistics (1)  Phenotype (2)  Predicted secondary structure (3)  Homology (4)  Micro-array data (5-12)  Each with the same class hierarchy  250 classes distributed over 4 levels  Number of examples: 1592 to 3932  Number of attributes: 52 to 47034

14 Our expectations…  How does HMC tree learning compare to the “straightforward” approach of learning 250 trees?  We expect:  Faster learning: Learning 1 HMCT is slower than learning 1 SPT (single prediction tree), but faster than learning 250 SPT’s  Much faster prediction: Using 1 HMCT for prediction is as fast as using 1 SPT for prediction, and hence 250 times faster than using 250 SPT’s  Larger trees: HMCT is larger than average tree for 1 class, but smaller than set of 250 trees  Less accurate: HMCT is less accurate than set of 250 SPT’s (but hopefully not much less accurate)  So how much faster / simpler / less accurate are our HMC trees?

15 The results  The HMCT is on average less complex than one single SPT  HMCT has 24 nodes, SPT’s on average 33 nodes  … but you’d need 250 of the latter to do the same job  The HMCT is on average slightly more accurate than a single SPT  Measured using “average precision-recall curves” (see graphs)  Surprising, as each SPT is tuned for one specific prediction task  Expectations w.r.t. efficiency are confirmed  Learning: min. speedup factor = 4.5x, max 65x, average 37x  Prediction: >250 times faster (since tree is not larger)  Faster to learn, much faster to apply

16 Precision recall curves Precision: proportion of predictions that is correct P(X | predicted X) Recall: proportion of class memberships correctly identified P(predicted X | X)

17 An example rule IF Nitrogen_Depletion_8_h <= -2.74 AND Nitrogen_Depletion_2_h > -1.94 AND 1point5_mM_diamide_5_min > -0.03 AND 1M_sorbitol___45_min_ > -0.36 AND 37C_to_25C_shock___60_min > 1.28 THEN 40, 40/3, 5, 5/1  High interpretability: IF-THEN rules extracted from the HMCT are quite simple For class 40/3: Recall = 0.15; precision = 0.97. (rule covers 15% of all class 40/3 cases, and 97% of the cases fulfilling these conditions are indeed 40/3)

18 The effect of merging… Optimized for c 1 Optimized for c 2 Optimized for c 250... Optimized for c 1, c 2, …, c 250 - Smaller than average individual tree - More accurate than average individual tree

19 Any explanation for these results?  Seems too good to be true… how is it possible?  Answer: the classes are not independent  Different trees for different classes actually share structure  Explains some complexity reduction achieved by the HMC tree, but not all !  One class carries information on other classes  This increases the signal-to-noise ratio  Provides better guidance when learning the tree (explaining good accuracy)  Avoids overfitting (explaining further reduction of tree size)  This was confirmed empirically

20 Overfitting  To check our “overfitting” hypothesis:  Compared area under PR curve on training set (A tr ) and test set (A te )  For SPC: A tr – A te = 0.219  For HMCT: A tr – A te = 0.024  (to verify, we tried Weka’s M5’ too: 0.387)  So HMCT clearly overfits much less

21 Conclusions  Surprising discovery: a single tree can be found that  predicts 250 different functions with, on average, equal or better accuracy than special-purpose trees for each function  is not more complex than a single special-purpose tree (hence, 250 times simpler than the whole set)  is (much) more efficient to learn and to apply  The reason for this is to be found in the dependencies between the gene functions  Provide better guidance when learning the tree  Help to avoid overfitting  Multiple prediction / HMC trees have a lot of potential and should be used more often !

22 Ongoing work  More extensive experimentation  Predicting classes in a lattice instead of a tree-shaped hierarchy

