Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns Jinyan Li and Limsoon Wong Speaker: Sarah Chan CSIS DB Seminar.

Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns Jinyan Li and Limsoon Wong Speaker: Sarah Chan CSIS DB Seminar Nov 15, 2002

Presentation Outline  Introduction  Decision trees – single, bagged and boosted  Emerging patterns (EPs)  EP-based classifier – PCL  Comparisons using gene expression data  Conclusions

Introduction  Decision tree Introduced by Hunt et al. in 1966 A compact tree structure consisting of classification rules learnt from training data Example rule: “if outlook = sunny then Play” Sharp discrimination power Advantage over black-box learning models: rules are easily comprehensible

Introduction  Decision tree Problems (will explain later) 1. Fragmentation 2. Single coverage constraint Loss of accuracy  Inferior to Neural Networks and Support Vector Machines (SVMs)

Introduction  Emerging patterns (EPs) Introduced by Dong and Li in 1999 (SIGKDD) EP: condition(s) that occur more frequently in one class than in others e.g. “outlook = sunny” is an EP in the “Play” class Sharp discrimination power EP-based classifiers (e.g. CAEP, JEP-C, PCL)  Overcome the fragmentation and single coverage problems  Accuracy is competitive to the best

Decision Trees – Example Record IDOutlookTemp( o F)Humidity(%)WindyClass 1sunny7570truePlay 2sunny8090trueDon ’ t Play 3sunny85 falseDon ’ t Play 4sunny7295trueDon ’ t Play 5sunny6970falsePlay 6overcast7290truePlay 7overcast8378falsePlay 8overcast6465truePlay 9overcast8175falsePlay 10rain7180trueDon ’ t Play 11rain6570trueDon ’ t Play 12rain7580falsePlay 13rain6880falsePlay 14rain7096falsePlay  Play tennis? 9 “Play”s and 5 “Don’t Play”s

Decision Trees – Example Record IDOutlookTemp( o F)Humidity(%)WindyClass 1sunny7570truePlay 2sunny8090trueDon ’ t Play 3sunny85 falseDon ’ t Play 4sunny7295trueDon ’ t Play 5sunny6970falsePlay 6overcast7290truePlay 7overcast8378falsePlay 8overcast6465truePlay 9overcast8175falsePlay 10rain7180trueDon ’ t Play 11rain6570trueDon ’ t Play 12rain7580falsePlay 13rain6880falsePlay 14rain7096falsePlay outlook? 6, 7, 8, 9 Play Humidity?Windy? sunny overcast rain 1, 5 Play 2, 3, 4 Don’t 10, 11 Don’t  75 > 75falsetrue 12, 13, 14 Play Internal node: a test on an attribute Branch: an outcome of the test Sub-tree: a subset of training data Leaf: a class label  Play tennis? Decision tree induced by C4.5 9 “Play”s and 5 “Don’t Play”s

Decision Trees – Example  Classification rule: path from root to leaf  Five C4.5 rules found Rule No. C4.5 rulesCoverage in Play class Coverage in Don’t Play class 1{outlook = overcast}4 (44%)0 2{outlook = rain, windy = false}3 (33%)0 3 {outlook = sunny, Humi  75} 2 (22%)0 4{outlook = sunny, Humi > 75}03 (60%) 5{outlook = rain, windy = true}02 (40%) Total595 (Single coverage constraint) The total coverage cannot exceed the no. of records in the class.

Decision Trees - Induction  Divide-and-conquer splitting strategy  Key: how to choose the most discriminatory attribute to split the training instances?  Splitting criteria ML: ID3 [Quinlan, 1986], C4.5 [Quinlan, 1993]  Choose attribute that maximizes entropy reduction Statistics: CART [Breiman et al., 1984]  Choose attribute that minimizes the Gini index Pattern recognition: ChAID [Magidson, 1994]  Choose attribute with max. correlation between its corresponding labels and itself

Decision Trees – Problems  Fragmentation: at the lower levels of the tree, feature selection makes use of fewer and fewer training data  Locally important but globally insignificant rules  Single coverage constraint: each training instance is covered by one and only one rule  Loss of significant rules

Decision Trees  Single  Bagged [Breiman, 1994]  Boosted [Schapire, 1989]  Assumptions B 0 = training dataset R = no. of trials that the base classifier is applied } multiple decision trees (sub-classifiers)

Decision Trees – Bagging  For each trial i (1 to R) Generate a bootstrapped training set B i from B 0  where |B i | = |B o |, and  instances in B 0 may appear repeated times or not at all in B i Build a classifier C i using B i  Builds a bagged classifier C* by aggregating C 1, C 2, … C R  Class predicted by C* = class predicted most often by its sub-classifiers (break ties arbitrarily)  bagging = bootstrap aggregating

Decision Trees – Boosting  Builds a committee of classifiers sequentially  Each new classifier C i is influenced by the performance of those built previously (C 1,.. C i-1 )  Data misclassified by previous models are emphasized in the new model  Weighs individual classifiers’ output differently, depending on their performance  AdaBoost [Freund & Schapire, 1995] = Adaptive Boosting

Emerging Patterns (EPs)  Itemset: set of attribute-value pairs (discretized)  EP: an itemset which occurs more frequently in one class than in any other classes  JEP (Jumping EP): an itemset that is only found in one class

Emerging Patterns (EPs)  Boundary EPs: JEPs whose proper subsets are not JEPs JEPs that are maximally frequent Separate EPs with finite growth rate and JEPs with lower frequencies Can be mined efficiently by border-based algorithms (MBD-LL BORDER )  JEPs can be regarded as classification rules E.g. “outlook = sunny” is a JEP in the “Play” class Equivalent to rule “if outlook = sunny then Play”

Emerging Patterns (EPs) – Example  Twelve EPs found EP No. EPsCoverage in Play class Coverage in Don’t Play class 1 { Humi  80, windy = false } 5 (56%)0 2 { Temp  75, windy = false } 4 (44%)0 3 *{ outlook = overcast }4 (44%)0 4 *{ outlook = rain, windy = false }3 (33%)0 5 * { outlook = sunny, Humi  80 } 2 (22%)0 6 { Temp > 75, Humi  80 } 2 (22%)0 7{ outlook = rain, Humi > 80 }1 (11%)0 8 *{ outlook = sunny, Humi > 80 }03 (60%) 9{ Temp > 75, Humi > 80 }02 (40%) 10{ outlook = sunny, Temp > 75 }02 (40%) 11 *{ outlook = rain, windy = true }02 (40%) 12{ Temp > 75, windy = true }01 (20%) Total122110 * C4.5 rules No single coverage constraint!

EP Approach – Characteristics  Aggregate the discriminating power of EPs  A cluster of trees Each EP is a tree with only one branch (labeled ‘true’) Some EPs can be integrated into a bigger tree  No single coverage constraint e.g. Record 5 satisfies EPs 1, 2 and 5  Record 5 { outlook = sunny, Temp = 69, Humi = 70, windy = false }  EP 1 { Humi  80, windy = false }  EP 2 { Temp  75, windy = false }  EP 5 { outlook = sunny, Humi  80 }

EP Approach – Characteristics  Globally significant rules Some C4.5 rules are discovered based only on a fraction of a class and a fraction of another class Boundary EPs differentiate one class from another using many instances; the distinction is in terms of a whole class  Exponential number of EPs By greedy heuristic and single coverage constraint, C4.5 produces trees with small numbers of rules No. of boundary EPs may increase exponentially with no. of attributes and no. of discretized values  Solution: Select the most important attributes for EP discovery

PCL  Prediction by Collective Likelihood of EPs  Motivation A test instance should contain  many top-ranked EPs from its own (home) class, and  a few (or no) low-ranked EPs from its opposite class May contain top-ranked EPs from its opposite class Uses multiple highly frequent EPs of the home class to avoid confusing signals from counterpart EPs

PCL  Training phase Given two training datasets D P and D N, discover boundary EPs from each of them Ranks the EPs of each dataset in descending order of their frequency:  EPs of +ve class: EP_P_1, EP_P_2, … EP_P_i  EPs of -ve class: EP_N_1, EP_N_2, … EP_N_ j

PCL  Testing phase Suppose a test instance T contains the following EPs:  EP_P_i 1, EP_P_i 2, … EP_P_i x (where 1  i 1 < i 2 <... i x  i)  EP_N_ j 1, EP_N_ j 2, … EP_N_ j y (where 1  j 1 < j 2 <... j x  j) Measures how far the top k EPs contained in T are away from the top k EPs of a class  score(T)_D P = freq(EP_P_i 1 ) / freq(EP_P_1) + freq(EP_P_i 2 ) / freq(EP_P_2) + … + freq(EP_P_i k ) / freq(EP_P_k) Assigns T to the class with the highest score  Ideal case: one score is k, the rest are 0

Experiment 1 – Dataset  Prostate disease  102 instances (52 Tumor and 50 Normal)  12,600 attributes (expression level of a gene)

Experiment 1 – EPs  EP approach Many globally significant rules The top 20 rules have ~70% coverage in home class Attributes had been discretized by an entropy-based discretization method, 20 (genes) were chosen EPs were mined using border-based algorithms

Experiment 1 – C4.5 Rules  Decision tree approach 5 leaves 5 rules found, 3 globally insignificant (coverage  6) Among the 4 features (genes) selected by C4.5, only 1 is in common with the 20 most important genes (minor rules resulted, important rules missed) The next two highest coverage genes selected by C4.5 are in the 47 th and 869 th positions among EPs

Experiment 1 – Accuracy and Error Rates All experiments were conducted on the discretized data of the 20 most important genes. Method Performance PCL C4.5 SVM 3-NN k = 5, 10, 15 Single Bagged Boosted LOOCV Accuracy (%) 95.1 95.1 95.1 91.2 94.1 93.1 90.2 96.1 #misclassified 5 5 5 9 6 7 10 4 10-fold Accuracy (%) 97.1 97.1 95.1 92.2 92.2 93.1 90.2 96.1 #misclassified 3 3 5 8 8 7 10 4 Without feature selection, the LOOCV accuracy of the 3 forms of trees are 87%, 92% and 92% (error rates 13, 8, 8) respectively.  3-NN has the highest accuracy  PCL’s performance is comparable to 3-NN  Both are better than C4.5 and SVM

Experiment 2 – Dataset  Acute Lymphoblastic Leukemia (ALL) disease  327 instances with 12,558 attributes  215 training and 112 testing instances  6 sub-types and a ‘mini-type’ (the rest)

Experiment 2 – LOOCV Error Rates All experiments were conducted on the discretized data of the 20 most important genes. Dataset (test data size PCL C4.5 SVM 3-NN in each class) k=20, 25, 30 Single Bagged Boosted BCR-ABL vs others (6:106) 1 1 1 8 (5) 6 (2) 8 (5) 2 1 E2A-PBX1 vs others (9:103) 0 0 0 0 (0) 0 (0) 0 (0) 0 0 HyperL50 vs others (22:90) 4 4 4 11 (9) 6 (6) 11 (5) 3 5 MLL vs others (6:106) 0 0 0 4 (2) 1 (0) 4 (2) 0 0 T-ALL vs others (15:97) 0 0 0 1 (1) 1 (1) 1 (1) 0 0 TEL-AML1 vs others (27:85) 2 2 2 4 (4) 3 (1) 4 (1) 2 2 mini-type vs others (27:85) 21 21 12 26 (12) 18 (13) 11 (8) 11 10 Parallel (7 classes) 6 7 8 27 20 10 26 11  PCL’s performance is competitive to SVM and 3-NN, sometimes better  C4.5 single tree approach does not work well, improved by bagging and boosting

 EPs preserve the comprehensibility of classification rules derived from decision trees.  EPs overcome the fragmentation problem and single coverage constraint of decision trees.  With gene expression datasets, PCL is better than C4.5 (single, bagged or boosted) on accuracy and rules, and is competitive to SVM and k-NN in terms of accuracy. Conclusions

References  L. Breiman. Bagging Predictors, Machine Learning, Vol. 24, No. 2, pp. 123-140., 1996.  Jinyan Li and Limsoon Wong. Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns, ICDM-02.  J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.

Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns Jinyan Li and Limsoon Wong Speaker: Sarah Chan CSIS DB Seminar.

Similar presentations

Presentation on theme: "Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns Jinyan Li and Limsoon Wong Speaker: Sarah Chan CSIS DB Seminar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns Jinyan Li and Limsoon Wong Speaker: Sarah Chan CSIS DB Seminar.

Similar presentations

Presentation on theme: "Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns Jinyan Li and Limsoon Wong Speaker: Sarah Chan CSIS DB Seminar."— Presentation transcript:

Similar presentations

About project

Feedback