Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns Jinyan Li and Limsoon Wong Speaker: Sarah Chan CSIS DB Seminar.

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Decision Tree.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
Classification Techniques: Decision Tree Learning
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Ensemble Learning: An Introduction
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Lecture 5 (Classification with Decision Trees)
Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell.
Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Ensemble Learning (2), Tree and Forest
Decision Tree Learning
Chapter 7 Decision Tree.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
Mohammad Ali Keyvanrad
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Using Emerging Patterns to Analyze Gene Expression Data Jinyan Li BioComputing Group Knowledge & Discovery Program Laboratories for Information Technology.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
CS690L Data Mining: Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Ensemble Classifiers.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Prepared by: Mahmoud Rafeek Al-Farra
Trees, bagging, boosting, and stacking
Data Science Algorithms: The Basic Methods
Data Mining Classification: Basic Concepts and Techniques
Classification and Prediction
Chapter 7: Transformations
A task of induction to find patterns
A task of induction to find patterns
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns Jinyan Li and Limsoon Wong Speaker: Sarah Chan CSIS DB Seminar Nov 15, 2002

Presentation Outline  Introduction  Decision trees – single, bagged and boosted  Emerging patterns (EPs)  EP-based classifier – PCL  Comparisons using gene expression data  Conclusions

Introduction  Decision tree Introduced by Hunt et al. in 1966 A compact tree structure consisting of classification rules learnt from training data Example rule: “if outlook = sunny then Play” Sharp discrimination power Advantage over black-box learning models: rules are easily comprehensible

Introduction  Decision tree Problems (will explain later) 1. Fragmentation 2. Single coverage constraint Loss of accuracy  Inferior to Neural Networks and Support Vector Machines (SVMs)

Introduction  Emerging patterns (EPs) Introduced by Dong and Li in 1999 (SIGKDD) EP: condition(s) that occur more frequently in one class than in others e.g. “outlook = sunny” is an EP in the “Play” class Sharp discrimination power EP-based classifiers (e.g. CAEP, JEP-C, PCL)  Overcome the fragmentation and single coverage problems  Accuracy is competitive to the best

Decision Trees – Example Record IDOutlookTemp( o F)Humidity(%)WindyClass 1sunny7570truePlay 2sunny8090trueDon ’ t Play 3sunny85 falseDon ’ t Play 4sunny7295trueDon ’ t Play 5sunny6970falsePlay 6overcast7290truePlay 7overcast8378falsePlay 8overcast6465truePlay 9overcast8175falsePlay 10rain7180trueDon ’ t Play 11rain6570trueDon ’ t Play 12rain7580falsePlay 13rain6880falsePlay 14rain7096falsePlay  Play tennis? 9 “Play”s and 5 “Don’t Play”s

Decision Trees – Example Record IDOutlookTemp( o F)Humidity(%)WindyClass 1sunny7570truePlay 2sunny8090trueDon ’ t Play 3sunny85 falseDon ’ t Play 4sunny7295trueDon ’ t Play 5sunny6970falsePlay 6overcast7290truePlay 7overcast8378falsePlay 8overcast6465truePlay 9overcast8175falsePlay 10rain7180trueDon ’ t Play 11rain6570trueDon ’ t Play 12rain7580falsePlay 13rain6880falsePlay 14rain7096falsePlay outlook? 6, 7, 8, 9 Play Humidity?Windy? sunny overcast rain 1, 5 Play 2, 3, 4 Don’t 10, 11 Don’t  75 > 75falsetrue 12, 13, 14 Play Internal node: a test on an attribute Branch: an outcome of the test Sub-tree: a subset of training data Leaf: a class label  Play tennis? Decision tree induced by C4.5 9 “Play”s and 5 “Don’t Play”s

Decision Trees – Example  Classification rule: path from root to leaf  Five C4.5 rules found Rule No. C4.5 rulesCoverage in Play class Coverage in Don’t Play class 1{outlook = overcast}4 (44%)0 2{outlook = rain, windy = false}3 (33%)0 3 {outlook = sunny, Humi  75} 2 (22%)0 4{outlook = sunny, Humi > 75}03 (60%) 5{outlook = rain, windy = true}02 (40%) Total595 (Single coverage constraint) The total coverage cannot exceed the no. of records in the class.

Decision Trees - Induction  Divide-and-conquer splitting strategy  Key: how to choose the most discriminatory attribute to split the training instances?  Splitting criteria ML: ID3 [Quinlan, 1986], C4.5 [Quinlan, 1993]  Choose attribute that maximizes entropy reduction Statistics: CART [Breiman et al., 1984]  Choose attribute that minimizes the Gini index Pattern recognition: ChAID [Magidson, 1994]  Choose attribute with max. correlation between its corresponding labels and itself

Decision Trees – Problems  Fragmentation: at the lower levels of the tree, feature selection makes use of fewer and fewer training data  Locally important but globally insignificant rules  Single coverage constraint: each training instance is covered by one and only one rule  Loss of significant rules

Decision Trees  Single  Bagged [Breiman, 1994]  Boosted [Schapire, 1989]  Assumptions B 0 = training dataset R = no. of trials that the base classifier is applied } multiple decision trees (sub-classifiers)

Decision Trees – Bagging  For each trial i (1 to R) Generate a bootstrapped training set B i from B 0  where |B i | = |B o |, and  instances in B 0 may appear repeated times or not at all in B i Build a classifier C i using B i  Builds a bagged classifier C* by aggregating C 1, C 2, … C R  Class predicted by C* = class predicted most often by its sub-classifiers (break ties arbitrarily)  bagging = bootstrap aggregating

Decision Trees – Boosting  Builds a committee of classifiers sequentially  Each new classifier C i is influenced by the performance of those built previously (C 1,.. C i-1 )  Data misclassified by previous models are emphasized in the new model  Weighs individual classifiers’ output differently, depending on their performance  AdaBoost [Freund & Schapire, 1995] = Adaptive Boosting

Emerging Patterns (EPs)  Itemset: set of attribute-value pairs (discretized)  EP: an itemset which occurs more frequently in one class than in any other classes  JEP (Jumping EP): an itemset that is only found in one class

Emerging Patterns (EPs)  Boundary EPs: JEPs whose proper subsets are not JEPs JEPs that are maximally frequent Separate EPs with finite growth rate and JEPs with lower frequencies Can be mined efficiently by border-based algorithms (MBD-LL BORDER )  JEPs can be regarded as classification rules E.g. “outlook = sunny” is a JEP in the “Play” class Equivalent to rule “if outlook = sunny then Play”

Emerging Patterns (EPs) – Example  Twelve EPs found EP No. EPsCoverage in Play class Coverage in Don’t Play class 1 { Humi  80, windy = false } 5 (56%)0 2 { Temp  75, windy = false } 4 (44%)0 3 *{ outlook = overcast }4 (44%)0 4 *{ outlook = rain, windy = false }3 (33%)0 5 * { outlook = sunny, Humi  80 } 2 (22%)0 6 { Temp > 75, Humi  80 } 2 (22%)0 7{ outlook = rain, Humi > 80 }1 (11%)0 8 *{ outlook = sunny, Humi > 80 }03 (60%) 9{ Temp > 75, Humi > 80 }02 (40%) 10{ outlook = sunny, Temp > 75 }02 (40%) 11 *{ outlook = rain, windy = true }02 (40%) 12{ Temp > 75, windy = true }01 (20%) Total * C4.5 rules No single coverage constraint!

EP Approach – Characteristics  Aggregate the discriminating power of EPs  A cluster of trees Each EP is a tree with only one branch (labeled ‘true’) Some EPs can be integrated into a bigger tree  No single coverage constraint e.g. Record 5 satisfies EPs 1, 2 and 5  Record 5 { outlook = sunny, Temp = 69, Humi = 70, windy = false }  EP 1 { Humi  80, windy = false }  EP 2 { Temp  75, windy = false }  EP 5 { outlook = sunny, Humi  80 }

EP Approach – Characteristics  Globally significant rules Some C4.5 rules are discovered based only on a fraction of a class and a fraction of another class Boundary EPs differentiate one class from another using many instances; the distinction is in terms of a whole class  Exponential number of EPs By greedy heuristic and single coverage constraint, C4.5 produces trees with small numbers of rules No. of boundary EPs may increase exponentially with no. of attributes and no. of discretized values  Solution: Select the most important attributes for EP discovery

PCL  Prediction by Collective Likelihood of EPs  Motivation A test instance should contain  many top-ranked EPs from its own (home) class, and  a few (or no) low-ranked EPs from its opposite class May contain top-ranked EPs from its opposite class Uses multiple highly frequent EPs of the home class to avoid confusing signals from counterpart EPs

PCL  Training phase Given two training datasets D P and D N, discover boundary EPs from each of them Ranks the EPs of each dataset in descending order of their frequency:  EPs of +ve class: EP_P_1, EP_P_2, … EP_P_i  EPs of -ve class: EP_N_1, EP_N_2, … EP_N_ j

PCL  Testing phase Suppose a test instance T contains the following EPs:  EP_P_i 1, EP_P_i 2, … EP_P_i x (where 1  i 1 < i 2 <... i x  i)  EP_N_ j 1, EP_N_ j 2, … EP_N_ j y (where 1  j 1 < j 2 <... j x  j) Measures how far the top k EPs contained in T are away from the top k EPs of a class  score(T)_D P = freq(EP_P_i 1 ) / freq(EP_P_1) + freq(EP_P_i 2 ) / freq(EP_P_2) + … + freq(EP_P_i k ) / freq(EP_P_k) Assigns T to the class with the highest score  Ideal case: one score is k, the rest are 0

Experiment 1 – Dataset  Prostate disease  102 instances (52 Tumor and 50 Normal)  12,600 attributes (expression level of a gene)

Experiment 1 – EPs  EP approach Many globally significant rules The top 20 rules have ~70% coverage in home class Attributes had been discretized by an entropy-based discretization method, 20 (genes) were chosen EPs were mined using border-based algorithms

Experiment 1 – C4.5 Rules  Decision tree approach 5 leaves 5 rules found, 3 globally insignificant (coverage  6) Among the 4 features (genes) selected by C4.5, only 1 is in common with the 20 most important genes (minor rules resulted, important rules missed) The next two highest coverage genes selected by C4.5 are in the 47 th and 869 th positions among EPs

Experiment 1 – Accuracy and Error Rates All experiments were conducted on the discretized data of the 20 most important genes. Method Performance PCL C4.5 SVM 3-NN k = 5, 10, 15 Single Bagged Boosted LOOCV Accuracy (%) #misclassified fold Accuracy (%) #misclassified Without feature selection, the LOOCV accuracy of the 3 forms of trees are 87%, 92% and 92% (error rates 13, 8, 8) respectively.  3-NN has the highest accuracy  PCL’s performance is comparable to 3-NN  Both are better than C4.5 and SVM

Experiment 2 – Dataset  Acute Lymphoblastic Leukemia (ALL) disease  327 instances with 12,558 attributes  215 training and 112 testing instances  6 sub-types and a ‘mini-type’ (the rest)

Experiment 2 – LOOCV Error Rates All experiments were conducted on the discretized data of the 20 most important genes. Dataset (test data size PCL C4.5 SVM 3-NN in each class) k=20, 25, 30 Single Bagged Boosted BCR-ABL vs others (6:106) (5) 6 (2) 8 (5) 2 1 E2A-PBX1 vs others (9:103) (0) 0 (0) 0 (0) 0 0 HyperL50 vs others (22:90) (9) 6 (6) 11 (5) 3 5 MLL vs others (6:106) (2) 1 (0) 4 (2) 0 0 T-ALL vs others (15:97) (1) 1 (1) 1 (1) 0 0 TEL-AML1 vs others (27:85) (4) 3 (1) 4 (1) 2 2 mini-type vs others (27:85) (12) 18 (13) 11 (8) Parallel (7 classes)  PCL’s performance is competitive to SVM and 3-NN, sometimes better  C4.5 single tree approach does not work well, improved by bagging and boosting

 EPs preserve the comprehensibility of classification rules derived from decision trees.  EPs overcome the fragmentation problem and single coverage constraint of decision trees.  With gene expression datasets, PCL is better than C4.5 (single, bagged or boosted) on accuracy and rules, and is competitive to SVM and k-NN in terms of accuracy. Conclusions

References  L. Breiman. Bagging Predictors, Machine Learning, Vol. 24, No. 2, pp ,  Jinyan Li and Limsoon Wong. Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns, ICDM-02.  J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.