Classification with Decision Trees and Rules. Copyright © Andrew W. Moore Density Estimation – looking ahead Compare it against the two other major kinds.

Slides:

Advertisements

Similar presentations

COMP3740 CR32: Knowledge Management and Adaptive Systems

Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.

CHAPTER 9: Decision Trees

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

RIPPER Fast Effective Rule Induction

Decision Tree Approach in Data Mining

Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5

Classification Techniques: Decision Tree Learning

Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Decision Tree Rong Jin. Determine Milage Per Gallon.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Induction of Decision Trees

1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.

Three kinds of learning

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Decision trees and empirical methodology Sec 4.3,

ICS 273A Intro Machine Learning

The joy of Entropy.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Decision Tree Learning

INTRODUCTION TO MACHINE LEARNING David Kauchak CS 451 – Fall 2013.

Fall 2004 TDIDT Learning CS478 - Machine Learning.

Machine Learning Chapter 3. Decision Tree Learning

Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.

Scaling up Decision Trees Shannon Quinn (with thanks to William Cohen of CMU, and B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo of IIT)

Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.

Machine Learning Queens College Lecture 2: Decision Trees.

Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)

Scaling up Decision Trees. Decision tree learning.

For Wednesday No reading Homework: –Chapter 18, exercise 6.

Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

CS690L Data Mining: Classification

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.

MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.

CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.

Class1 Class2 The methods discussed so far are Linear Discriminants.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.

Decision Tree Learning

1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Some Common Nonlinear Learning Methods. Learning methods we’ve seen so far: Linear classifiers: – naïve Bayes, logistic regression, perceptron, support.

Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.

DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.

Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Review Session. Examples: How I would study Can you explain these to a friend? Can you find them in your notes? Can you find a problem in the sample midterms.

Machine Learning in Practice Lecture 18

DECISION TREES An internal node represents a test on an attribute.

Decision Trees an introduction.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Artificial Intelligence

Data Science Algorithms: The Basic Methods

Supervised Learning Seminar Social Media Mining University UC3M

Issues in Decision-Tree Learning Avoiding overfitting through pruning

Decision Tree Saed Sayad 9/21/2018.

Classification and Prediction

Statistical Learning Dong Liu Dept. EEIS, USTC.

Machine Learning in Practice Lecture 17

INTRODUCTION TO Machine Learning 2nd Edition

CS639: Data Management for Data Science

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Data Mining CSCI 307, Spring 2019 Lecture 6

Presentation transcript:

Classification with Decision Trees and Rules

Copyright © Andrew W. Moore Density Estimation – looking ahead Compare it against the two other major kinds of models: Regressor Prediction of real-valued output Input Attributes Density Estimator Probability Input Attributes Classifier Prediction of categorical output or class Input Attributes One of a few discrete values

DECISION TREE LEARNING: OVERVIEW

Decision tree learning

A decision tree

Another format: a set of rules if O=sunny and H<= 70 then PLAY else if O=sunny and H>70 then DON’T_PLAY else if O=overcast then PLAY else if O=rain and windy then DON’T_PLAY else if O=rain and !windy then PLAY if O=sunny and H<= 70 then PLAY else if O=sunny and H>70 then DON’T_PLAY else if O=overcast then PLAY else if O=rain and windy then DON’T_PLAY else if O=rain and !windy then PLAY if O=sunny and H> 70 then DON’T_PLAY else if O=rain and windy then DON’T_PLAY else PLAY if O=sunny and H> 70 then DON’T_PLAY else if O=rain and windy then DON’T_PLAY else PLAY Simpler rule set One rule per leaf in the tree

A regression tree Play = 30m, 45min Play = 0m, 0m, 15mPlay = 0m, 0m Play = 20m, 30m, 45m, Play ~= 33 Play ~= 24 Play ~= 37 Play ~= 5 Play ~= 0 Play ~= 32 Play ~= 18 Play = 45m,45,60,40 Play ~= 48

Motivations for trees and rules Often you can find a fairly accurate classifier which is small and easy to understand. – Sometimes this gives you useful insight into a problem, or helps you debug a feature set. Sometimes features interact in complicated ways – Trees can find interactions (e.g., “sunny and humid”). Again, sometimes this gives you some insight into the problem. Trees are very inexpensive at test time – You don’t always even need to compute all the features of an example. – You can even build classifiers that take this into account…. – Sometimes that’s important (e.g., “bloodPressure<100” vs “MRIScan=normal” might have different costs to compute).

An example: “Is it the Onion”? On the Onion data… Dataset: 200 Onion articles, ~500 Economist articles. Accuracies: almost 100% with Naïve Bayes! I used a rule-learning method called RIPPER

Translation: if “enlarge” is in the set-valued attribute wordsArticle then class = fromOnion. this rule is correct 173 times, and never wrong … if “added” is in the set-valued attribute wordsArticle and “play” is in the set-valued attribute wordsArticle then class = fromOnion. this rule is correct 6 times, and wrong once … Translation: if “enlarge” is in the set-valued attribute wordsArticle then class = fromOnion. this rule is correct 173 times, and never wrong … if “added” is in the set-valued attribute wordsArticle and “play” is in the set-valued attribute wordsArticle then class = fromOnion. this rule is correct 6 times, and wrong once …

After cleaning ‘Enlarge Image’ lines Also, estimated test error rate increases from 1.4% to 6%

Different Subcategories of Economist Articles

Motivations for trees and rules Often you can find a fairly accurate classifier which is small and easy to understand. – Sometimes this gives you useful insight into a problem, or helps you debug a feature set. Sometimes features interact in complicated ways – Trees can find interactions (e.g., “sunny and humid”) that linear classifiers can’t – Again, sometimes this gives you some insight into the problem. Trees are very inexpensive at test time – You don’t always even need to compute all the features of an example. – You can even build classifiers that take this into account…. – Sometimes that’s important (e.g., “bloodPressure<100” vs “MRIScan=normal” might have different costs to compute). Rest of the class: the algorithms. But first…decision tree learning algorithms are based on information gain heuristics.

BACKGROUND: ENTROPY AND OPTIMAL CODES

Information theory Problem: design an efficient coding scheme for leaf colors: green yellow gold red orange brown Problem: design an efficient coding scheme for leaf colors: green yellow gold red orange brown yellow, green,... encode decode … yellow,green, …

yellow, green,... encode decode … yellow,green, …

H p

DECISION TREE LEARNING: THE ALGORITHM(S)

Most decision tree learning algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y … or nearly so – pick the best split, on the best attribute a a=c 1 or a=c 2 or … a<θ or a ≥ θ a or not(a) a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree

Most decision tree learning algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y … or nearly so... – pick the best split, on the best attribute a a=c 1 or a=c 2 or … a<θ or a ≥ θ a or not(a) a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Popular splitting criterion: try to lower entropy of the y labels on the resulting partition i.e., prefer splits that have skewed distributions of labels

112 Most decision tree learning algorithms “ Pruning” a tree – avoid overfitting by removing subtrees somehow 15 13

Most decision tree learning algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y … or nearly so.. – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Same idea 15 13

Another view of a decision tree Sepal_length<5.7 Sepal_width>2.8

Another view of a decision tree

Overfitting and k-NN Small tree  a smooth decision boundary Large tree  a complicated shape What’s the best size decision tree? Error/Loss on training set D Error/Loss on an unseen test set D test hi error 31 small treelarge tree

DECISION TREE LEARNING: BREAKING IT DOWN

Breaking down decision tree learning First: how to classify - assume everything is binary function prPos = classifyTree(T, x) if T is a leaf node with counts n,p prPos = (p + 1)/(p + n +2) -- Laplace smoothing else j = T.splitAttribute if x(j)==0 then prPos = classifyTree(T.leftSon, x) else prPos = classifyTree(T.rightSon, x) function prPos = classifyTree(T, x) if T is a leaf node with counts n,p prPos = (p + 1)/(p + n +2) -- Laplace smoothing else j = T.splitAttribute if x(j)==0 then prPos = classifyTree(T.leftSon, x) else prPos = classifyTree(T.rightSon, x)

Breaking down decision tree learning Reduced error pruning with information gain – Split the data D (2/3, 1/3) into D grow and D prune – Build the tree recursively with D grow T = growTree(D grow ) – Prune the tree with D prune T’ = pruneTree(D prune,T) – Return T’

Breaking down decision tree learning First: divide & conquer to build the tree with D grow function T = growTree(X,Y) if |X|<10 or allOneLabel(Y) then T = leafNode(|Y==0|,|Y==1|) -- counts for n,p else for i = 1,…n -- for each attribute i ai = X(:, i) -- column i of X gain(i) = infoGain( Y, Y(ai==0), Y(ai==1) ) j = argmax(gain); -- the best attribute aj = X(:, j) T = splitNode( growTree(X(aj==0),Y(aj==0)), -- left son growTree(X(aj==1),Y(aj==1)), --right son j) function T = growTree(X,Y) if |X|<10 or allOneLabel(Y) then T = leafNode(|Y==0|,|Y==1|) -- counts for n,p else for i = 1,…n -- for each attribute i ai = X(:, i) -- column i of X gain(i) = infoGain( Y, Y(ai==0), Y(ai==1) ) j = argmax(gain); -- the best attribute aj = X(:, j) T = splitNode( growTree(X(aj==0),Y(aj==0)), -- left son growTree(X(aj==1),Y(aj==1)), --right son j)

Breaking down decision tree learning function e = entropy(Y) n = |Y|; p0 = |Y==0|/n; p1 = |Y==1| /n; e = - p0*log(p0) - p1*log(p1) function e = entropy(Y) n = |Y|; p0 = |Y==0|/n; p1 = |Y==1| /n; e = - p0*log(p0) - p1*log(p1)

Breaking down decision tree learning function g = infoGain(Y,leftY,rightY) n = |Y|; nLeft = |leftY|; nRight = |rightY|; g = entropy(Y) - (nLeft/n)*entropy(leftY) - (nRight/n)*entropy(rightY) function e = entropy(Y) n = |Y|; p0 = |Y==0|/n; p1 = |Y==1| /n; e = - p1*log(p1) - p2*log(p2) function g = infoGain(Y,leftY,rightY) n = |Y|; nLeft = |leftY|; nRight = |rightY|; g = entropy(Y) - (nLeft/n)*entropy(leftY) - (nRight/n)*entropy(rightY) function e = entropy(Y) n = |Y|; p0 = |Y==0|/n; p1 = |Y==1| /n; e = - p1*log(p1) - p2*log(p2) First: how to build the tree with D grow

Breaking down decision tree learning Reduced error pruning with information gain – Split the data D (2/3, 1/3) into D grow and D prune – Build the tree recursively with D grow T = growTree(D grow ) – Prune the tree with D prune T’ = pruneTree(D prune ) – Return T’

Breaking down decision tree learning Next: how to prune the tree with D prune – Estimate the error rate of every subtree on D prune – Recursively traverse the tree: Reduce error on the left, right subtrees of T If T would have lower error if it were converted to a leaf, convert T to a leaf.

A decision tree We’re using the fact that the examples for sibling subtrees are disjoint.

Breaking down decision tree learning To estimate error rates, classify the whole pruning set, and keep some counts function classifyPruneSet(T, X, Y) T.pruneN = |Y==0|; T.pruneP = |Y==1| if T is not a leaf then j = T.splitAttribute aj = X(:, j) classifyPruneSet( T.leftSon, X(aj==0), Y(aj==0) ) classifyPruneSet( T.rightSon, X(aj==1), Y(aj==1) ) function classifyPruneSet(T, X, Y) T.pruneN = |Y==0|; T.pruneP = |Y==1| if T is not a leaf then j = T.splitAttribute aj = X(:, j) classifyPruneSet( T.leftSon, X(aj==0), Y(aj==0) ) classifyPruneSet( T.rightSon, X(aj==1), Y(aj==1) ) function e = errorsOnPruneSetAsLeaf(T): min(T.pruneN, T.pruneP) function e = errorsOnPruneSetAsLeaf(T): min(T.pruneN, T.pruneP)

Breaking down decision tree learning Next: how to prune the tree with D prune – Estimate the error rate of every subtree on D prune – Recursively traverse the tree: function T1 = pruned(T) if T is a leaf then -- copy T, adding an error estimate T.minErrors T1= leaf(T, errorsOnPruneSetAsLeaf(T)) else e1 = errorsOnPruneSetAsLeaf(T) TLeft = pruned(T.leftSon); TRight = pruned(T.rightSon); e2 = TLeft.minErrors + TRight.minErrors; if e1 <= e2 then T1 = leaf(T,e1) -- cp + add error estimate else T1 = splitNode(T, e2) -- cp + add error estimate function T1 = pruned(T) if T is a leaf then -- copy T, adding an error estimate T.minErrors T1= leaf(T, errorsOnPruneSetAsLeaf(T)) else e1 = errorsOnPruneSetAsLeaf(T) TLeft = pruned(T.leftSon); TRight = pruned(T.rightSon); e2 = TLeft.minErrors + TRight.minErrors; if e1 <= e2 then T1 = leaf(T,e1) -- cp + add error estimate else T1 = splitNode(T, e2) -- cp + add error estimate

Decision trees: plus and minus Simple and fast to learn Arguably easy to understand (if compact) Very fast to use: – often you don’t even need to compute all attribute values Can find interactions between variables (play if it’s cool and sunny or ….) and hence nonlinear decision boundaries Don’t need to worry about how numeric values are scaled

Decision trees: plus and minus Hard to prove things about Not well-suited to probabilistic extensions Sometimes fail badly on problems that seem easy – the IRIS dataset is an example

Fixing decision trees…. Hard to prove things about Don’t (typically) improve over linear classifiers when you have lots of features Sometimes fail badly on problems that linear classifiers perform well on – One solution is to build ensembles of decision trees – more on this later

RULE LEARNING: OVERVIEW

Rules for Subcategories of Economist Articles

Trees vs Rules For every tree with L leaves, there is a corresponding rule set with L rules – So one way to learn rules is to extract them from trees. But: – Sometimes the extracted rules can be drastically simplified – For some rule sets, there is no tree that is nearly the same size – So rules are more expressive given a size constraint This motivated learning rules directly

Separate and conquer rule-learning Start with an empty rule set Iteratively do this – Find a rule that works well on the data – Remove the examples “covered by” the rule (they satisfy the “if” part) from the data Stop when all data is covered by some rule Possibly prune the rule set On later iterations, the data is different

Separate and conquer rule-learning Start with an empty rule set Iteratively do this – Find a rule that works well on the data Start with an empty rule Iteratively – Add a condition that is true for many positive and few negative examples – Stop when the rule covers no negative examples (or almost no negative examples) – Remove the examples “covered by” the rule Stop when all data is covered by some rule

Separate and conquer rule-learning functon Rules = separateAndConquer(X,Y) Rules = empty rule set while there are positive examples in X,Y not covered by rule R do R = empty list of conditions CoveredX = X; CoveredY = Y; -- specialize R until it covers only positive examples while CoveredY contains some negative examples -- compute the “gain” for each condition x(j)==1 …. j = argmax(gain); aj=CoveredX(:,j); R = R conjoined with condition “x(j)==1” -- add best condition -- remove examples not covered by R from CoveredX,CoveredY CoveredX = X(aj); CoveredY = Y(aj); Rules = Rules plus new rule R -- remove examples covered by R from X,Y … functon Rules = separateAndConquer(X,Y) Rules = empty rule set while there are positive examples in X,Y not covered by rule R do R = empty list of conditions CoveredX = X; CoveredY = Y; -- specialize R until it covers only positive examples while CoveredY contains some negative examples -- compute the “gain” for each condition x(j)==1 …. j = argmax(gain); aj=CoveredX(:,j); R = R conjoined with condition “x(j)==1” -- add best condition -- remove examples not covered by R from CoveredX,CoveredY CoveredX = X(aj); CoveredY = Y(aj); Rules = Rules plus new rule R -- remove examples covered by R from X,Y …