Download presentation
Presentation is loading. Please wait.
Published byAugustine Francis Modified over 9 years ago
1
CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer; tweaked by Dan Weld
2
Logistics PS 1 – due in two weeks – Did we already suggest that you … start early See website for reading 2
3
© Daniel S. Weld 3 Why is Learning Possible? Experience alone never justifies any conclusion about any unseen instance. Learning occurs when PREJUDICE meets DATA!
4
© Daniel S. Weld 4 Some Typical Biases – Occam’s razor “It is needless to do more when less will suffice” – William of Occam, died 1349 of the Black plague – MDL – Minimum description length – Concepts can be approximated by –... conjunctions of predicates... by linear functions... by short decision trees
5
5 Overview of Learning What is Being Learned? Type of Supervision (eg, Experience, Feedback) Labeled Examples RewardNothing Discrete Function ClassificationClustering Continuous Function Regression PolicyApprenticeship Learning Reinforcement Learning
6
A learning problem: predict fuel efficiency From the UCI repository (thanks to Ross Quinlan) 40 Records Discrete data (for now) Predict MPG
7
© Daniel S. Weld7 Past Insight Any ML problem may be cast as the problem of FUNCTION APPROXIMATION
8
A learning problem: predict fuel efficiency 40 Records Discrete data (for now) Predict MPG X Y Need to find “Hypothesis”: f : X Y
9
How Represent Function? f ( ) ) Conjunctions in Propositional Logic? maker=asia weight=low Need to find “Hypothesis”: f : X Y
10
Restricted Hypothesis Space © Daniel S. Weld10 Many possible representations Natural choice: conjunction of attribute constraints For each attribute: – Constrain to a specific value: eg maker=asia – Don’t care: ? For example maker cyl displace weight accel …. asia ? ? low ? Represents maker=asia weight=low
11
© Daniel S. Weld11 Consistency Say an “example is consistent with a hypothesis” when the example logically satisfies the hypothesis Hypothesis: maker=asia weight=low maker cyl displace weight accel …. asia ? ? low ? Examples: asia5low … usa4low …
12
12 Ordering on Hypothesis Space © Daniel S. Weld x1x1 asia5low x2x2 usa4med h1: maker=asia accel=low h3: maker=asia weight=low h2: maker=asia
13
Ok, so how does it perform? Version Space Algorithm 13
14
How Represent Function? f ( ) ) General Propositional Logic? maker=asia weight=low Need to find “Hypothesis”: f : X Y
15
Hypotheses: decision trees f : X Y Each internal node tests an attribute x i Each branch assigns an attribute value x i =v Each leaf assigns a class y To classify input x ? traverse the tree from root to leaf, output the labeled y Cylinders 34568 goodbad MakerHorsepower lowmed highamericaasiaeurope bad good bad
16
Hypothesis space How many possible hypotheses? Cylinders 34568 goodbad Maker Horsepowe r low med highamericaasiaeurope bad good bad
17
Hypothesis space How many possible hypotheses? What functions can be represented? Cylinders 34568 goodbad Maker Horsepowe r low med highamericaasiaeurope bad good bad
18
What functions can be represented? cyl=3 (cyl=4 (maker=asia maker=europe)) … Cylinders 34568 goodbad Maker Horsepowe r low med highamericaasiaeurope bad good bad
19
Hypothesis space How many possible hypotheses? What functions can be represented? How many will be consistent with a given dataset? Cylinders 34568 goodbad Maker Horsepowe r low med highamericaasiaeurope bad good bad
20
Hypothesis space How many possible hypotheses? What functions can be represented? How many will be consistent with a given dataset? How will we choose the best one? Cylinders 34568 goodbad Maker Horsepowe r low med highamericaasiaeurope bad good bad
21
Are all decision trees equal? Many trees can represent the same concept But, not all trees will have the same size! e.g., = (A ∧ B) ( A ∧ C) A BC t t f f + _ tf + _ Which tree do we prefer? B CC tf f + tf + _ A tf A _ + _ tt f
22
Learning decision trees is hard!!! Finding the simplest (smallest) decision tree is an NP-complete problem [Hyafil & Rivest ’76] Resort to a greedy heuristic: – Start from empty decision tree – Split on next best attribute (feature) – Recurse
23
© Daniel S. Weld23 What is the Simplest Tree? Is this a good tree? [22+, 18-] Means: correct on 22 examples incorrect on 18 examples predict mpg=bad
24
Improving Our Tree predict mpg=bad
25
Recursive Step Take the Original Dataset.. And partition it according to the value of the attribute we split on Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8
26
Recursive Step Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8 Build tree from These records.. Build tree from These records.. Build tree from These records.. Build tree from These records..
27
Second level of tree Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia (Similar recursion in the other cases)
28
A full tree
29
Splitting: choosing a good attribute X1X1 X2X2 Y TTT TFT TTT TFT FTT FFF FTF FFF X1X1 Y=t : 4 Y=f : 0 tf Y=t : 1 Y=f : 3 X2X2 Y=t : 3 Y=f : 1 tf Y=t : 2 Y=f : 2 Would we prefer to split on X 1 or X 2 ? Idea: use counts at leaves to define probability distributions so we can measure uncertainty!
30
Measuring uncertainty Good split if we are more certain about classification after split – Deterministic good (all true or all false) – Uniform distribution? – What about distributions in between? P(Y=A) = 1/3P(Y=B) = 1/4P(Y=C) = 1/4P(Y=D) = 1/6 P(Y=A) = 1/2P(Y=B) = 1/4P(Y=C) = 1/8P(Y=D) = 1/8 Bad
31
Entropy Entropy H(Y) of a random variable Y More uncertainty, more entropy! Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)
32
Entropy Example X1X1 X2X2 Y TTT TFT TTT TFT FTT FFF P(Y=t) = 5/6 P(Y=f) = 1/6 H(Y) = - 5/6 log 2 5/6 - 1/6 log 2 1/6 = 0.65
33
Conditional Entropy Conditional Entropy H( Y |X) of a random variable Y conditioned on a random variable X X1X1 Y=t : 4 Y=f : 0 tf Y=t : 1 Y=f : 1 P(X 1 =t) = 4/6 P(X 1 =f) = 2/6 X1X1 X2X2 Y TTT TFT TTT TFT FTT FFF Example: H(Y|X 1 ) = - 4/6 (1 log 2 1 + 0 log 2 0) - 2/6 (1/2 log 2 1/2 + 1/2 log 2 1/2) = 2/6 = 0.33
34
Information Gain Advantage of attribute – decrease in entropy (uncertainty) after splitting X1X1 X2X2 Y TTT TFT TTT TFT FTT FFF In our running example: IG(X 1 ) = H(Y) – H(Y|X 1 ) = 0.65 – 0.33 IG(X 1 ) > 0 we prefer the split!
35
Learning Decision Trees Start from empty decision tree Split on next best attribute (feature) – Use, for example, information gain to select attribute: Recurse
36
Suppose we want to predict MPG Now, Look at all the information gains… predict mpg=bad
37
Tree After One Iteration
38
When to Terminate? ©Carlos Guestrin 2005-200938
39
Base Case One Don’t split a node if all matching records have the same output value
40
Base Case Two Don’t split a node if none of the attributes can create multiple [non-empty] children
41
Base Case Two: No attributes can distinguish
42
Base Cases: An idea Base Case One: If all records in current data subset have the same output then don’t recurse Base Case Two: If all records have exactly the same set of input attributes then don’t recurse Proposed Base Case 3: If all attributes have zero information gain then don’t recurse Is this a good idea?
43
The problem with Base Case 3 y = a XOR b The information gains:The resulting decision tree:
44
If we omit Base Case 3: y = a XOR b The resulting decision tree: Is it OK to omit Base Case 3?
45
Decision Tree Building Summarized BuildTree(DataSet,Output) If all output values are the same in DataSet, return a leaf node that says “predict this unique output” If all input values are the same, return a leaf node that says “predict the majority output” Else find attribute X with highest Info Gain Suppose X has n X distinct values (i.e. X has arity n X ). – Create and return a non-leaf node with n X children. – The i’th child should be built by calling BuildTree(DS i,Output) Where DS i built consists of all those records in DataSet for which X = ith distinct value of X.
46
General View of a Classifier 0.01.02.03.04.05.06.0 0.01.02.03.0 Hypothesis: Function for labeling examples + + + + + + + + - - - - - - - - - - + ++ - - - + + Label: - Label: + ? ? ? ?
47
© Daniel S. Weld47
48
Ok, so how does it perform? 48
49
MPG Test set error
50
MPG test set error The test set error is much worse than the training set error… …why?
51
Decision trees will overfit Standard decision trees have no learning bias – Training set error is always zero! (If there is no label noise) – Lots of variance – Will definitely overfit!!! – Must introduce some bias towards simpler trees Many strategies for picking simpler trees: – Fixed depth – Fixed number of leaves – Or something smarter…
52
Occam’s Razor Why Favor Short Hypotheses? Arguments for: – Fewer short hypotheses than long ones →A short hyp. less likely to fit data by coincidence →Longer hyp. that fit data may might be coincidence Arguments against: – Argument above on really uses the fact that hypothesis space is small!!! – What is so special about small sets based on the size of each hypothesis?
53
How to Build Small Trees Two reasonable approaches: Optimize on the held-out (development set) – If growing the tree larger hurts performance, then stop growing!!! – Requires a larger amount of data… Use statistical significance testing – Test if the improvement for any split it likely due to noise – If so, don’t to the split!
54
Consider this split
55
A chi-square test Suppose that mpg was completely uncorrelated with maker. What is the chance we’d have seen data of at least this apparent level of association anyway? By using a particular kind of chi-square test, the answer is 13.5% (Such hypothesis tests are easy to compute, unfortunately, not enough time to cover in the lecture, but in your homework, you’ll have fun! :))
56
Using Chi-squared to avoid overfitting Build the full decision tree as before But when you can grow it no more, start to prune: – Beginning at the bottom of the tree, delete splits in which p chance > MaxPchance – Continue working you way up until there are no more prunable nodes MaxPchance is a magic parameter you must specify to the decision tree, indicating your willingness to risk fitting noise
57
Pruning example With MaxPchance = 0.05, you will see the following MPG decision tree: When compared with the unpruned tree improved test set accuracy worse training accuracy
58
MaxPchance Technical note: MaxPchance is a regularization parameter that helps us bias towards simpler models Smaller TreesLarger Trees MaxPchance Increasing Decreasing Expected Test set Error We’ll learn to choose the value of magic parameters like this one later!
59
Real-Valued inputs What should we do if some of the inputs are real-valued? Infinite number of possible split values!!! Finite dataset, only finite number of relevant splits!
60
“One branch for each numeric value” idea: Hopeless: with such high branching factor will shatter the dataset and overfit
61
Threshold splits Binary tree, split on attribute X at value t – One branch: X < t – Other branch: X ≥ t
62
The set of possible thresholds Binary tree, split on attribute X – One branch: X < t – Other branch: X ≥ t Search through possible values of t – Seems hard!!! But only finite number of t’s are important – Sort data according to X into {x 1,…,x m } – Consider split points of the form x i + (x i+1 – x i )/2
63
Picking the best threshold Suppose X is real valued Define IG(Y|X:t) as H(Y) - H(Y|X:t) Define H(Y|X:t) = H(Y|X = t) P(X >= t) IG(Y|X:t) is the information gain for predicting Y if all you know is whether X is greater than or less than t Then define IG*(Y|X) = max t IG(Y|X:t) For each real-valued attribute, use IG*(Y|X) for assessing its suitability as a split Note, may split on an attribute multiple times, with different thresholds
64
Example with MPG
65
Example tree for our continuous dataset
66
What you need to know about decision trees Decision trees are one of the most popular ML tools – Easy to understand, implement, and use – Computationally cheap (to solve heuristically) Information gain to select attributes (ID3, C4.5,…) Presented for classification, can be used for regression and density estimation too Decision trees will overfit!!! – Must use tricks to find “simple trees”, e.g., Fixed depth/Early stopping Pruning Hypothesis testing
67
Acknowledgements Some of the material in the decision trees presentation is courtesy of Andrew Moore, from his excellent collection of ML tutorials: – http://www.cs.cmu.edu/~awm/tutorials http://www.cs.cmu.edu/~awm/tutorials
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.