Download presentation
Presentation is loading. Please wait.
Published byRandall Haynes Modified over 9 years ago
1
Decision Trees for Hierarchical Multilabel Classification : A Case Study in Functional Genomics Hendrik Blockeel 1, Leander Schietgat 1, Jan Struyf 1,2, Saso Dzeroski 3, Amanda Clare 4 1 Katholieke Universiteit Leuven 2 University of Wisconsin, Madison 3 Jozef Stefan Institute, Ljubljana 4 University of Wales, Aberystwyth
2
Overview The task: Hierarchical multilabel classification (HMC) Applied to functional genomics Decision trees for HMC Multiple prediction with decision trees HMC decision trees Experiments How does HMC tree learning compare to learning multiple standard trees? Conclusions
3
Classification settings Normally, in classification, we assign one class label c i from a set C = {c 1, …, c k } to each example In multilabel classification, we have to assign a subset S C to each example i.e., one example can belong to multiple classes Some applications: Text classification: assign subjects (newsgroups) to texts Functional genomics: assign functions to genes In hierarchical multilabel classification (HMC), the classes C form a hierarchy C, Partial order expresses “is a superclass of”
4
Hierarchical multilabel classification Hierarchy constraint: c i c j coverage(c j ) coverage(c i ) Elements of a class must be elements of its superclasses Should hold for given data as well as predictions Straightforward way to learn a HMC model: Learn k binary classifiers, one for each class Disadvantages: 1. difficult to guarantee hierarchy constraint 2. skewed class distributions (few pos, many neg) 3. relatively slow 4. no simple interpretable model Alternative: learn one classifier that predicts a vector of classes Quite natural for, e.g., neural networks We will do this with (interpretable) decision trees
5
Goal of this work There has been work on extending decision tree learning to the HMC case Multiple prediction trees: Blockeel et al., ICML 1998; Clare and King, ECML 2001; … HMC trees: Blockeel et al., 2002; Clare, 2003; Struyf et al., 2005 HMC trees were evaluated in functional genomics, with good results ( proof of concept) But: no comparison with learning multiple single classification trees has been made Size of trees, predictive accuracy, runtimes… Previous work focused on the knowledge discovery aspect We compare both approaches for functional genomics
6
Functional genomics Task: Given a data set with descriptions of genes and the functions they have, learn a model that can predict for a new gene what functions it performs A gene can have multiple functions (out of 250 possible functions, in our case) Could be done with decision trees, with all the advantages that brings (fast, interpretable)… But: Decision trees predict only one class, not a set of classes Should we learn a separate tree for each function? 250 functions = 250 trees: not so fast and interpretable anymore! Name A 1 A 2 A 3 ….. A n 1 2 3 4 5 … 249 250 G1 … … … … x x x x G2 … … … … x x x G3 … … … … x x x … … … …. … … … … … … … … … descriptionfunctions … 12250
7
Multiple prediction trees A multiple prediction tree (MPT) makes multiple predictions at once Basic idea: (Blockeel, De Raedt, Ramon, 1998) A decision tree learner prefers tests that yield much information on the “class” attribute (measured using information gain (C4.5) or variance reduction (CART)) MPT learner prefers tests that reduce variance for all target variables together Variance = mean squared distance of vectors to mean vector, in k-D space Name A 1 A 2 A 3 ….. A n 1 2 3 4 5 … 249 250 G1 … … … … x x x x G2 … … … … x x x G3 … … … … x x x … … … …. … … … … … … … … … descriptionfunction 14,12,105,250 1,5,24,351401,52
8
The algorithm Procedure MPTree(T) returns tree (t*,h*,P*) = (none, , ) For each possible test t P = partition induced by t on T h = Tk P |T k |/|T| Var(T k ) if (h<h*) and acceptable(t,P) (t*,h*,P*) = (t,h,P) If t* <> none for each Tk P* tree k = MPTree(T k ) return node(t*, k {tree k }) Else return leaf(v)
9
HMC tree learning A special case of MPT learning Class vector contains all classes in hierarchy Main characteristics: Errors higher up in the hierarchy are more important Use weighted euclidean distance (higher weight for higher classes) Need to ensure hierarchy constraint Normally, leaf predicts c i iff proportion of c i examples in leaf is above some threshold t i (often 0.5) We will let t i vary (see further) To ensure compliance with hierarchy constraint: c i c j t i t j Automatically fulfilled if all t i equal
10
Example c1c1 c2c2 c3c3. c4c4 c5c5 c6c6 c7c7 d 2 (x 1, x 2 ) = 0.25 + 0.25 = 0.5 d 2 (x 1, x 3 ) = 1+1 = 2 x 1 is more similar to x 2 than to x 3 DT tries to create leaves with “similar” examples i.e., relatively pure w.r.t. class sets Weight 1 Weight 0.5 x1: {c 1, c 3, c 5 } = [1,0,1,0,1,0,0] x2: {c 1, c 3, c 7 } = [1,0,1,0,0,0,1] x3: {c 1, c 2, c 5 } = [1,1,0,0,0,0,0] c1c1 c2c2 c3c3. c4c4 c5c5 c6c6 c7c7 c1c1 c2c2 c3c3. c4c4 c5c5 c6c6 c7c7 c1c1 c2c2 c3c3. c4c4 c5c5 c6c6 c7c7 x1x1 x2x2 x3x3
11
Evaluating HMC trees Original work by Clare et al.: Derive rules with high “accuracy” and “coverage” from the tree Quality of individual rules was assessed No simple overall criterion to assess quality of tree In this work: using precision-recall curves Precision = P(pos| predicted pos) Recall = P(predicted pos | pos) The P,R of a tree depends on the tresholds t i used By changing the threshold t i from 1 to 0, a precision-recall curve emerges For 250 classes: Precision = P(X | predicted X) [with X any of the 250 classes] Recall = P(predicted X | X) This gives a PR curve that is a kind of “average” of the individual PR curves for each class
12
The Clus system Created by Jan Struyf Propositional DT learner, implemented in Java Implements ideas from C4.5 (Quinlan, ’93) CART (Breiman et al., ’84) predictive clustering trees (Blockeel et al., ’98) includes multiple prediction trees and hierarchical multilabel classification trees Reads data in ARFF format (Weka) We used two versions for our experiments: Clus-HMC: HMC version as explained Clus-SC: single classification version, +/- CART
13
The datasets 12 datasets from functional genomics Each with a different description of the genes Sequence statistics (1) Phenotype (2) Predicted secondary structure (3) Homology (4) Micro-array data (5-12) Each with the same class hierarchy 250 classes distributed over 4 levels Number of examples: 1592 to 3932 Number of attributes: 52 to 47034
14
Our expectations… How does HMC tree learning compare to the “straightforward” approach of learning 250 trees? We expect: Faster learning: Learning 1 HMCT is slower than learning 1 SPT (single prediction tree), but faster than learning 250 SPT’s Much faster prediction: Using 1 HMCT for prediction is as fast as using 1 SPT for prediction, and hence 250 times faster than using 250 SPT’s Larger trees: HMCT is larger than average tree for 1 class, but smaller than set of 250 trees Less accurate: HMCT is less accurate than set of 250 SPT’s (but hopefully not much less accurate) So how much faster / simpler / less accurate are our HMC trees?
15
The results The HMCT is on average less complex than one single SPT HMCT has 24 nodes, SPT’s on average 33 nodes … but you’d need 250 of the latter to do the same job The HMCT is on average slightly more accurate than a single SPT Measured using “average precision-recall curves” (see graphs) Surprising, as each SPT is tuned for one specific prediction task Expectations w.r.t. efficiency are confirmed Learning: min. speedup factor = 4.5x, max 65x, average 37x Prediction: >250 times faster (since tree is not larger) Faster to learn, much faster to apply
16
Precision recall curves Precision: proportion of predictions that is correct P(X | predicted X) Recall: proportion of class memberships correctly identified P(predicted X | X)
17
An example rule IF Nitrogen_Depletion_8_h <= -2.74 AND Nitrogen_Depletion_2_h > -1.94 AND 1point5_mM_diamide_5_min > -0.03 AND 1M_sorbitol___45_min_ > -0.36 AND 37C_to_25C_shock___60_min > 1.28 THEN 40, 40/3, 5, 5/1 High interpretability: IF-THEN rules extracted from the HMCT are quite simple For class 40/3: Recall = 0.15; precision = 0.97. (rule covers 15% of all class 40/3 cases, and 97% of the cases fulfilling these conditions are indeed 40/3)
18
The effect of merging… Optimized for c 1 Optimized for c 2 Optimized for c 250... Optimized for c 1, c 2, …, c 250 - Smaller than average individual tree - More accurate than average individual tree
19
Any explanation for these results? Seems too good to be true… how is it possible? Answer: the classes are not independent Different trees for different classes actually share structure Explains some complexity reduction achieved by the HMC tree, but not all ! One class carries information on other classes This increases the signal-to-noise ratio Provides better guidance when learning the tree (explaining good accuracy) Avoids overfitting (explaining further reduction of tree size) This was confirmed empirically
20
Overfitting To check our “overfitting” hypothesis: Compared area under PR curve on training set (A tr ) and test set (A te ) For SPC: A tr – A te = 0.219 For HMCT: A tr – A te = 0.024 (to verify, we tried Weka’s M5’ too: 0.387) So HMCT clearly overfits much less
21
Conclusions Surprising discovery: a single tree can be found that predicts 250 different functions with, on average, equal or better accuracy than special-purpose trees for each function is not more complex than a single special-purpose tree (hence, 250 times simpler than the whole set) is (much) more efficient to learn and to apply The reason for this is to be found in the dependencies between the gene functions Provide better guidance when learning the tree Help to avoid overfitting Multiple prediction / HMC trees have a lot of potential and should be used more often !
22
Ongoing work More extensive experimentation Predicting classes in a lattice instead of a tree-shaped hierarchy
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.