CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Advertisements

Decision Trees Decision tree representation ID3 learning algorithm
CHAPTER 9: Decision Trees
Machine Learning III Decision Tree Induction
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Lecture outline Classification Decision-tree classification.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Decision Tree Algorithm
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Ensemble Learning: An Introduction
Induction of Decision Trees
Machine Learning CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning.
Three kinds of learning
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
ICS 273A Intro Machine Learning
The joy of Entropy.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.
ECE 5984: Introduction to Machine Learning
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
CpSc 810: Machine Learning Decision Tree Learning.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Learning from Observations Chapter 18 Through
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.
Decision Tree Learning
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Decision/Classification Trees Readings: Murphy ; Hastie 9.2.
CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer; tweaked by Dan Weld.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Introduce to machine learning
Artificial Intelligence
Presented By S.Yamuna AP/CSE
Data Science Algorithms: The Basic Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning in Practice Lecture 17
CS639: Data Management for Data Science
Decision Trees Jeff Storey.
Presentation transcript:

CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld

Logistics Office Hours – see website Change in dataset Learning DTs with Attributes that have – Numerous Possible Values – Continuous Values – Missing Values Loss Functions 2

© Daniel S. Weld 3 Why is Learning Possible? Experience alone never justifies any conclusion about any unseen instance. Learning occurs when PREJUDICE meets DATA!

© Daniel S. Weld 4 Some Typical Biases – Occam’s razor “It is needless to do more when less will suffice” – William of Occam, died 1349 of the Black plague – MDL – Minimum description length – Concepts can be approximated by –... conjunctions of predicates... by linear functions... by short decision trees

5 Overview of Learning What is Being Learned? Type of Supervision (eg, Experience, Feedback) Labeled Examples RewardNothing Discrete Function ClassificationClustering Continuous Function Regression PolicyApprenticeship Learning Reinforcement Learning

A learning problem: predict fuel efficiency From the UCI repository (thanks to Ross Quinlan) 40 Records Discrete data (for now) Predict MPG

© Daniel S. Weld7 Past Insight Any ML problem may be cast as the problem of FUNCTION APPROXIMATION

A learning problem: predict fuel efficiency 40 Records Discrete data (for now) Predict MPG X Y Need to find “Hypothesis”: f : X  Y

How Represent Function? f ( ) )  Conjunctions in Propositional Logic? maker=asia  weight=low Need to find “Hypothesis”: f : X  Y

Restricted Hypothesis Space © Daniel S. Weld10 Many possible representations Natural choice: conjunction of attribute constraints For each attribute: – Constrain to a specific value: eg maker=asia – Don’t care: ? For example maker cyl displace weight accel …. asia ? ? low ? Represents maker=asia  weight=low

© Daniel S. Weld11 Consistency Say an “example is consistent with a hypothesis” when the example logically satisfies the hypothesis Hypothesis: maker=asia  weight=low maker cyl displace weight accel …. asia ? ? low ? Examples: asia5low … usa4low …

12 Ordering on Hypothesis Space © Daniel S. Weld x1x1 asia5low x2x2 usa4med h1: maker=asia  accel=low h3: maker=asia  weight=low h2: maker=asia

Ok, so how does it perform? Version Space Algorithm 13

How Represent Function? f ( ) )  General Propositional Logic? maker=asia  weight=low Need to find “Hypothesis”: f : X  Y

Hypotheses: decision trees f : X  Y Each internal node tests an attribute x i Each branch assigns an attribute value x i =v Each leaf assigns a class y To classify input x ? traverse the tree from root to leaf, output the labeled y Cylinders goodbad MakerHorsepower lowmed highamericaasiaeurope bad good bad

Hypothesis space How many possible hypotheses? Cylinders goodbad Maker Horsepowe r low med highamericaasiaeurope bad good bad

Hypothesis space How many possible hypotheses? What functions can be represented? Cylinders goodbad Maker Horsepowe r low med highamericaasiaeurope bad good bad

What functions can be represented? cyl=3  (cyl=4  (maker=asia  maker=europe))  … Cylinders goodbad Maker Horsepowe r low med highamericaasiaeurope bad good bad

Hypothesis space How many possible hypotheses? What functions can be represented? How many will be consistent with a given dataset? Cylinders goodbad Maker Horsepowe r low med highamericaasiaeurope bad good bad

Hypothesis space How many possible hypotheses? What functions can be represented? How many will be consistent with a given dataset? How will we choose the best one? Cylinders goodbad Maker Horsepowe r low med highamericaasiaeurope bad good bad

Are all decision trees equal? Many trees can represent the same concept But, not all trees will have the same size! e.g.,  = (A ∧ B)  (  A ∧ C) A BC t t f f + _ tf + _ Which tree do we prefer? B CC tf f + tf + _ A tf A _ + _ tt f

Learning decision trees is hard!!! Finding the simplest (smallest) decision tree is an NP-complete problem [Hyafil & Rivest ’76] Resort to a greedy heuristic: – Start from empty decision tree – Split on next best attribute (feature) – Recurse

© Daniel S. Weld23 What is the Simplest Tree? Is this a good tree? [22+, 18-] Means: correct on 22 examples incorrect on 18 examples predict mpg=bad

Improving Our Tree predict mpg=bad

Recursive Step Take the Original Dataset.. And partition it according to the value of the attribute we split on Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8

Recursive Step Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8 Build tree from These records.. Build tree from These records.. Build tree from These records.. Build tree from These records..

Second level of tree Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia (Similar recursion in the other cases)

A full tree

Two Questions 1.Which attribute gives the best split? 2.When to stop recursion? Greedy Algorithm: – Start from empty decision tree – Split on the best attribute (feature) – Recurse

Which attribute gives the best split? A 1 : The one with the highest information gain Defined in terms of entropy A 2 : Actually many alternatives, eg, accuracy Seeks to reduce the misclassification rate 30

Splitting: choosing a good attribute X1X1 X2X2 Y TTT TFT TTT TFT FTT FFF FTF FFF X1X1 Y=t : 4 Y=f : 0 tf Y=t : 1 Y=f : 3 X2X2 Y=t : 3 Y=f : 1 tf Y=t : 2 Y=f : 2 Would we prefer to split on X 1 or X 2 ? Idea: use counts at leaves to define probability distributions so we can measure uncertainty!

Measuring uncertainty Good split if we are more certain about classification after split – Deterministic good (all true or all false) – Uniform distribution? – What about distributions in between? P(Y=A) = 1/3P(Y=B) = 1/4P(Y=C) = 1/4P(Y=D) = 1/6 P(Y=A) = 1/2P(Y=B) = 1/4P(Y=C) = 1/8P(Y=D) = 1/8 Bad

Entropy Entropy H(Y) of a random variable Y More uncertainty, more entropy! Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)

Entropy Example X1X1 X2X2 Y TTT TFT TTT TFT FTT FFF P(Y=t) = 5/6 P(Y=f) = 1/6 H(Y) = - 5/6 log 2 5/6 - 1/6 log 2 1/6 = 0.65

Conditional Entropy Conditional Entropy H( Y |X) of a random variable Y conditioned on a random variable X X1X1 Y=t : 4 Y=f : 0 tf Y=t : 1 Y=f : 1 P(X 1 =t) = 4/6 P(X 1 =f) = 2/6 X1X1 X2X2 Y TTT TFT TTT TFT FTT FFF Example: H(Y|X 1 ) = - 4/6 (1 log log 2 0) - 2/6 (1/2 log 2 1/2 + 1/2 log 2 1/2) = 2/6 = 0.33

Information Gain Advantage of attribute – decrease in entropy (uncertainty) after splitting X1X1 X2X2 Y TTT TFT TTT TFT FTT FFF In our running example: IG(X 1 ) = H(Y) – H(Y|X 1 ) = 0.65 – 0.33 IG(X 1 ) > 0  we prefer the split!

Alternate Splitting Criteria Misclassification Impurity Minimum probability that a training pattern will be misclassified M(Y) = 1-max P(Y=y j ) Misclassification Gain IG M (X) = [1-max P(Y=y j )] -  P(x=x i ) [1- max P(Y=y j | x=x i ))] 37 ji j

Learning Decision Trees Start from empty decision tree Split on next best attribute (feature) – Use information gain (or…?) to select attribute: Recurse

Suppose we want to predict MPG Now, Look at all the information gains… predict mpg=bad

Tree After One Iteration

When to Terminate? ©Carlos Guestrin

Base Case One Don’t split a node if all matching records have the same output value

Base Case Two Don’t split a node if none of the attributes can create multiple [non-empty] children

Base Case Two: No attributes can distinguish

Base Cases: An idea Base Case One: If all records in current data subset have the same output then don’t recurse Base Case Two: If all records have exactly the same set of input attributes then don’t recurse Proposed Base Case 3: If all attributes have zero information gain then don’t recurse Is this a good idea?

The problem with Base Case 3 y = a XOR b The information gains:The resulting decision tree:

But Without Base Case 3: y = a XOR b The resulting decision tree: So:Base Case 3? Include or Omit?

Building Decision Trees BuildTree(DataSet,Output) If all output values are the same in DataSet, Then return a leaf node that says “predict this unique output” If all input values are the same, Then return a leaf node that says “predict the majority output” Else find attribute X with highest Info Gain Suppose X has n X distinct values (i.e. X has arity n X ). Create and return a non-leaf node with n X children. The i th child is built by calling BuildTree(DS i,Output) Where DS i consists of all those records in DataSet for which X = i th distinct value of X.

General View of a Classifier Hypothesis: Decision Boundary for labeling function Label: - Label: + ? ? ? ?

© Daniel S. Weld50

Ok, so how does it perform? 51

MPG Test set error

MPG test set error The test set error is much worse than the training set error… …why?

Decision trees will overfit Our decision trees have no learning bias – Training set error is always zero! (If there is no label noise) – Lots of variance – Will definitely overfit!!! – Must introduce some bias towards simpler trees Why might one pick simpler trees?

Occam’s Razor Why Favor Short Hypotheses? Arguments for: – Fewer short hypotheses than long ones →A short hyp. less likely to fit data by coincidence →Longer hyp. that fit data may might be coincidence Arguments against: – Argument above on really uses the fact that hypothesis space is small!!! – What is so special about small sets based on the complexity of each hypothesis?

How to Build Small Trees Several reasonable approaches: Stop growing tree before overfit – Bound depth or # leaves – Base Case 3 – Doesn’t work well in practice Grow full tree; then prune – Optimize on a held-out (development set) If growing the tree hurts performance, then cut back Con: Requires a larger amount of data… – Use statistical significance testing Test if the improvement for any split is likely due to noise If so, then prune the split! – Convert to logical rules Then simplify rules

Reduced Error Pruning Split data into training & validation sets (10-33%) Train on training set (overfitting) Do until further pruning is harmful: 1)Evaluate effect on validation set of pruning each possible node (and tree below it) 2)Greedily remove the node that most improves accuracy of validation set 57

© Daniel S. Weld58 Effect of Reduced-Error Pruning Cut tree back to here

Alternatively Chi-squared pruning – Grow tree fully – Consider leaves in turn Is parent split worth it? Compared to Base-Case 3? 59

Consider this split

A chi-square test Suppose that mpg was completely uncorrelated with maker. What is the chance we’d have seen data of at least this apparent level of association anyway? By using a particular kind of chi-square test, the answer is 13.5% Such hypothesis tests are relatively easy to compute, but involved

Using Chi-squared to avoid overfitting Build the full decision tree as before But when you can grow it no more, start to prune: – Beginning at the bottom of the tree, delete splits in which p chance > MaxPchance – Continue working you way up until there are no more prunable nodes MaxPchance is a magic parameter you must specify to the decision tree, indicating your willingness to risk fitting noise

Pruning example With MaxPchance = 0.05, you will see the following MPG decision tree: When compared with the unpruned tree improved test set accuracy worse training accuracy

Regularization Note for Future: MaxPchance is a regularization parameter that helps us bias towards simpler models Smaller TreesLarger Trees MaxPchance Increasing Decreasing Expected Test set Error We’ll learn to choose the value of magic parameters like this one later!

ML as Optimization Greedy search for best scoring hypothesis Where score = – Fits training data most accurately? – Sum: training accuracy – complexity penalty 65 regularization { To be continued…

Advanced Decision Trees Attributes with: – Numerous Possible Values – Continuous (Ordered) Values – Missing Values 66

Information Gain IG of attribute = Decrease in entropy (uncertainty) after splitting

Many Attribute Values What if split on S/N? H(MPG | S/N) ? good bad good bad good bad good bad good bad good S/N MPG

© Daniel S. Weld69 Attributes with many values So many values that splitting on attribute… – Divides examples into tiny sets – Each set likely homogeneous  high info gain – But poor predictor… Need to penalize these attributes – S/N is worst case, but correction is often needed

© Daniel S. Weld70 One Approach: Gain Ratio SplitInfo  entropy of S wrt values of A (Contrast with entropy of S wrt target value)  Attributes with many evenly distributed values SplitInformation = log 2 (n)… = 1 for Boolean IG

Ordinal (Real-Valued) Inputs What should we do if some of the inputs are real-valued? One branch for each value? Hopeless overfitting Good News: Bad News: GainRatio will reject these! They might have useful info! Eg, weight of car might  mpg

So…? ©Carlos Guestrin

Threshold Splits Binary tree, split on attribute X at value t – One branch: X < t – Other branch: X ≥ t Weight <= 3200Weight > 3200

The set of possible thresholds Binary tree, split on attribute X – One branch: X < t – Other branch: X ≥ t Search through possible values of t – Seems hard!!! But only finite number of t’s are important – Sort data according to X into {x 1,…,x m } – Consider split points of the form x i + (x i+1 – x i )/2 Consider all such splits?

Only Certain Splits Make Sense good 2600good 3100good 3500bad 4200good 4400bad 4600bad 6000bad 7200good 7800bad 7995bad weight MPG ?

Picking the best threshold Suppose X is real valued Define IG(Y|X:t) as H(Y) - H(Y|X:t) Define H(Y|X:t) = H(Y|X = t) P(X >= t) IG(Y|X:t) is the information gain for predicting Y if all you know is whether X is greater than or less than t Then define IG*(Y|X) = max t IG(Y|X:t) For each real-valued attribute, use IG*(Y|X) for assessing its suitability as a split Note, may split on an attribute multiple times, with different thresholds

Example Tree for MPG treating values as ordinal

Missing Data 78 What to do?? missing

Options Use “missing” as value Don’t use attribute at all Assign most common value to missing attribute Fractional values 79 missing

Fractional Values 66% low and 33% high Training – Use in gain calculations – Further subdivide if other missing attributes Test – Classification is most probable classification – Averaging over leaves where it got divided 80 missing

What you need to know about decision trees Decision trees are one of the most popular ML tools – Easy to understand, implement, and use – Computationally cheap (to solve heuristically) Information gain to select attributes (ID3, C4.5,…) Presented for classification, can be used for regression and density estimation too Decision trees will overfit!!! – Must use tricks to find “simple trees”, e.g., Fixed depth/Early stopping Pruning Hypothesis testing

Loss Functions How measure quality of hypothesis? 82

Loss Functions How measure quality of hypothesis? L(x, y, y) = utility(result of using y given input of x) - utility(result of using y given input of x) L(edible, poison) L(poison, edible) 83

Common Loss Functions 0/1 loss 0 if y=y else 1 Absolute value loss |y-y| Squared error loss|y-y| 2 84

85 Overview of Learning What is Being Learned? Type of Supervision (eg, Experience, Feedback) Labeled Examples RewardNothing Discrete Function ClassificationClustering Continuous Function Regression PolicyApprenticeship Learning Reinforcement Learning

Polynomial Curve Fitting Hypothesis Space Sin(2  x)

Sum-of-Squares Error Function

1 st Order Polynomial

3 rd Order Polynomial

9 th Order Polynomial

Over-fitting Root-Mean-Square (RMS) Error:

Polynomial Coefficients

Data Set Size: 9 th Order Polynomial

Data Set Size: 9 th Order Polynomial

Regularization Penalize large coefficient values

Regularization:

Regularization: vs.

Polynomial Coefficients

Acknowledgements Some of the material in the decision trees presentation is courtesy of Andrew Moore, from his excellent collection of ML tutorials: – Improved by – Carlos Guestrin & – Luke Zettlemoyer