Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns Jinyan Li and Limsoon Wong Speaker: Sarah Chan CSIS DB Seminar Nov 15, 2002
Presentation Outline Introduction Decision trees – single, bagged and boosted Emerging patterns (EPs) EP-based classifier – PCL Comparisons using gene expression data Conclusions
Introduction Decision tree Introduced by Hunt et al. in 1966 A compact tree structure consisting of classification rules learnt from training data Example rule: “if outlook = sunny then Play” Sharp discrimination power Advantage over black-box learning models: rules are easily comprehensible
Introduction Decision tree Problems (will explain later) 1. Fragmentation 2. Single coverage constraint Loss of accuracy Inferior to Neural Networks and Support Vector Machines (SVMs)
Introduction Emerging patterns (EPs) Introduced by Dong and Li in 1999 (SIGKDD) EP: condition(s) that occur more frequently in one class than in others e.g. “outlook = sunny” is an EP in the “Play” class Sharp discrimination power EP-based classifiers (e.g. CAEP, JEP-C, PCL) Overcome the fragmentation and single coverage problems Accuracy is competitive to the best
Decision Trees – Example Record IDOutlookTemp( o F)Humidity(%)WindyClass 1sunny7570truePlay 2sunny8090trueDon ’ t Play 3sunny85 falseDon ’ t Play 4sunny7295trueDon ’ t Play 5sunny6970falsePlay 6overcast7290truePlay 7overcast8378falsePlay 8overcast6465truePlay 9overcast8175falsePlay 10rain7180trueDon ’ t Play 11rain6570trueDon ’ t Play 12rain7580falsePlay 13rain6880falsePlay 14rain7096falsePlay Play tennis? 9 “Play”s and 5 “Don’t Play”s
Decision Trees – Example Record IDOutlookTemp( o F)Humidity(%)WindyClass 1sunny7570truePlay 2sunny8090trueDon ’ t Play 3sunny85 falseDon ’ t Play 4sunny7295trueDon ’ t Play 5sunny6970falsePlay 6overcast7290truePlay 7overcast8378falsePlay 8overcast6465truePlay 9overcast8175falsePlay 10rain7180trueDon ’ t Play 11rain6570trueDon ’ t Play 12rain7580falsePlay 13rain6880falsePlay 14rain7096falsePlay outlook? 6, 7, 8, 9 Play Humidity?Windy? sunny overcast rain 1, 5 Play 2, 3, 4 Don’t 10, 11 Don’t 75 > 75falsetrue 12, 13, 14 Play Internal node: a test on an attribute Branch: an outcome of the test Sub-tree: a subset of training data Leaf: a class label Play tennis? Decision tree induced by C4.5 9 “Play”s and 5 “Don’t Play”s
Decision Trees – Example Classification rule: path from root to leaf Five C4.5 rules found Rule No. C4.5 rulesCoverage in Play class Coverage in Don’t Play class 1{outlook = overcast}4 (44%)0 2{outlook = rain, windy = false}3 (33%)0 3 {outlook = sunny, Humi 75} 2 (22%)0 4{outlook = sunny, Humi > 75}03 (60%) 5{outlook = rain, windy = true}02 (40%) Total595 (Single coverage constraint) The total coverage cannot exceed the no. of records in the class.
Decision Trees - Induction Divide-and-conquer splitting strategy Key: how to choose the most discriminatory attribute to split the training instances? Splitting criteria ML: ID3 [Quinlan, 1986], C4.5 [Quinlan, 1993] Choose attribute that maximizes entropy reduction Statistics: CART [Breiman et al., 1984] Choose attribute that minimizes the Gini index Pattern recognition: ChAID [Magidson, 1994] Choose attribute with max. correlation between its corresponding labels and itself
Decision Trees – Problems Fragmentation: at the lower levels of the tree, feature selection makes use of fewer and fewer training data Locally important but globally insignificant rules Single coverage constraint: each training instance is covered by one and only one rule Loss of significant rules
Decision Trees Single Bagged [Breiman, 1994] Boosted [Schapire, 1989] Assumptions B 0 = training dataset R = no. of trials that the base classifier is applied } multiple decision trees (sub-classifiers)
Decision Trees – Bagging For each trial i (1 to R) Generate a bootstrapped training set B i from B 0 where |B i | = |B o |, and instances in B 0 may appear repeated times or not at all in B i Build a classifier C i using B i Builds a bagged classifier C* by aggregating C 1, C 2, … C R Class predicted by C* = class predicted most often by its sub-classifiers (break ties arbitrarily) bagging = bootstrap aggregating
Decision Trees – Boosting Builds a committee of classifiers sequentially Each new classifier C i is influenced by the performance of those built previously (C 1,.. C i-1 ) Data misclassified by previous models are emphasized in the new model Weighs individual classifiers’ output differently, depending on their performance AdaBoost [Freund & Schapire, 1995] = Adaptive Boosting
Emerging Patterns (EPs) Itemset: set of attribute-value pairs (discretized) EP: an itemset which occurs more frequently in one class than in any other classes JEP (Jumping EP): an itemset that is only found in one class
Emerging Patterns (EPs) Boundary EPs: JEPs whose proper subsets are not JEPs JEPs that are maximally frequent Separate EPs with finite growth rate and JEPs with lower frequencies Can be mined efficiently by border-based algorithms (MBD-LL BORDER ) JEPs can be regarded as classification rules E.g. “outlook = sunny” is a JEP in the “Play” class Equivalent to rule “if outlook = sunny then Play”
Emerging Patterns (EPs) – Example Twelve EPs found EP No. EPsCoverage in Play class Coverage in Don’t Play class 1 { Humi 80, windy = false } 5 (56%)0 2 { Temp 75, windy = false } 4 (44%)0 3 *{ outlook = overcast }4 (44%)0 4 *{ outlook = rain, windy = false }3 (33%)0 5 * { outlook = sunny, Humi 80 } 2 (22%)0 6 { Temp > 75, Humi 80 } 2 (22%)0 7{ outlook = rain, Humi > 80 }1 (11%)0 8 *{ outlook = sunny, Humi > 80 }03 (60%) 9{ Temp > 75, Humi > 80 }02 (40%) 10{ outlook = sunny, Temp > 75 }02 (40%) 11 *{ outlook = rain, windy = true }02 (40%) 12{ Temp > 75, windy = true }01 (20%) Total * C4.5 rules No single coverage constraint!
EP Approach – Characteristics Aggregate the discriminating power of EPs A cluster of trees Each EP is a tree with only one branch (labeled ‘true’) Some EPs can be integrated into a bigger tree No single coverage constraint e.g. Record 5 satisfies EPs 1, 2 and 5 Record 5 { outlook = sunny, Temp = 69, Humi = 70, windy = false } EP 1 { Humi 80, windy = false } EP 2 { Temp 75, windy = false } EP 5 { outlook = sunny, Humi 80 }
EP Approach – Characteristics Globally significant rules Some C4.5 rules are discovered based only on a fraction of a class and a fraction of another class Boundary EPs differentiate one class from another using many instances; the distinction is in terms of a whole class Exponential number of EPs By greedy heuristic and single coverage constraint, C4.5 produces trees with small numbers of rules No. of boundary EPs may increase exponentially with no. of attributes and no. of discretized values Solution: Select the most important attributes for EP discovery
PCL Prediction by Collective Likelihood of EPs Motivation A test instance should contain many top-ranked EPs from its own (home) class, and a few (or no) low-ranked EPs from its opposite class May contain top-ranked EPs from its opposite class Uses multiple highly frequent EPs of the home class to avoid confusing signals from counterpart EPs
PCL Training phase Given two training datasets D P and D N, discover boundary EPs from each of them Ranks the EPs of each dataset in descending order of their frequency: EPs of +ve class: EP_P_1, EP_P_2, … EP_P_i EPs of -ve class: EP_N_1, EP_N_2, … EP_N_ j
PCL Testing phase Suppose a test instance T contains the following EPs: EP_P_i 1, EP_P_i 2, … EP_P_i x (where 1 i 1 < i 2 <... i x i) EP_N_ j 1, EP_N_ j 2, … EP_N_ j y (where 1 j 1 < j 2 <... j x j) Measures how far the top k EPs contained in T are away from the top k EPs of a class score(T)_D P = freq(EP_P_i 1 ) / freq(EP_P_1) + freq(EP_P_i 2 ) / freq(EP_P_2) + … + freq(EP_P_i k ) / freq(EP_P_k) Assigns T to the class with the highest score Ideal case: one score is k, the rest are 0
Experiment 1 – Dataset Prostate disease 102 instances (52 Tumor and 50 Normal) 12,600 attributes (expression level of a gene)
Experiment 1 – EPs EP approach Many globally significant rules The top 20 rules have ~70% coverage in home class Attributes had been discretized by an entropy-based discretization method, 20 (genes) were chosen EPs were mined using border-based algorithms
Experiment 1 – C4.5 Rules Decision tree approach 5 leaves 5 rules found, 3 globally insignificant (coverage 6) Among the 4 features (genes) selected by C4.5, only 1 is in common with the 20 most important genes (minor rules resulted, important rules missed) The next two highest coverage genes selected by C4.5 are in the 47 th and 869 th positions among EPs
Experiment 1 – Accuracy and Error Rates All experiments were conducted on the discretized data of the 20 most important genes. Method Performance PCL C4.5 SVM 3-NN k = 5, 10, 15 Single Bagged Boosted LOOCV Accuracy (%) #misclassified fold Accuracy (%) #misclassified Without feature selection, the LOOCV accuracy of the 3 forms of trees are 87%, 92% and 92% (error rates 13, 8, 8) respectively. 3-NN has the highest accuracy PCL’s performance is comparable to 3-NN Both are better than C4.5 and SVM
Experiment 2 – Dataset Acute Lymphoblastic Leukemia (ALL) disease 327 instances with 12,558 attributes 215 training and 112 testing instances 6 sub-types and a ‘mini-type’ (the rest)
Experiment 2 – LOOCV Error Rates All experiments were conducted on the discretized data of the 20 most important genes. Dataset (test data size PCL C4.5 SVM 3-NN in each class) k=20, 25, 30 Single Bagged Boosted BCR-ABL vs others (6:106) (5) 6 (2) 8 (5) 2 1 E2A-PBX1 vs others (9:103) (0) 0 (0) 0 (0) 0 0 HyperL50 vs others (22:90) (9) 6 (6) 11 (5) 3 5 MLL vs others (6:106) (2) 1 (0) 4 (2) 0 0 T-ALL vs others (15:97) (1) 1 (1) 1 (1) 0 0 TEL-AML1 vs others (27:85) (4) 3 (1) 4 (1) 2 2 mini-type vs others (27:85) (12) 18 (13) 11 (8) Parallel (7 classes) PCL’s performance is competitive to SVM and 3-NN, sometimes better C4.5 single tree approach does not work well, improved by bagging and boosting
EPs preserve the comprehensibility of classification rules derived from decision trees. EPs overcome the fragmentation problem and single coverage constraint of decision trees. With gene expression datasets, PCL is better than C4.5 (single, bagged or boosted) on accuracy and rules, and is competitive to SVM and k-NN in terms of accuracy. Conclusions
References L. Breiman. Bagging Predictors, Machine Learning, Vol. 24, No. 2, pp , Jinyan Li and Limsoon Wong. Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns, ICDM-02. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.