Download presentation
Presentation is loading. Please wait.
Published byZoe Doyle Modified over 8 years ago
1
Decision Tree Learning CMPT 463
2
Reminders Homework 7 is due on Tuesday, May 10 Projects are due on Tuesday, May 10 o Moodle submission: readme.doc and project.zip Final Exam Review o Monday, May 9
3
Learning from Examples An agent is learning if it improves its performance on future tasks after making observations about the world. One class of learning problem: o from a collection of input-output pairs, learn a function that predicts the output for new inputs. o e.g., weather forecast, Google image
4
Why learning? The designer cannot anticipate all changes o A program designed to predict tomorrow’s stock market prices must learn to adapt when conditions change. Programmers sometimes have no idea how to program a solution o recognizing faces
5
Types of Learning Supervised learning o example input-output pairs and learns a function o e.g., spam detector Unsupervised learning o correct answers not given o e.g., clustering Reinforcement learning o rewards or punishments o taxi agent: lack of a tip
7
Supervised Learning Learning a function/rule from specific input- output pairs is also called inductive learning. Given a training set of N example pairs: o (x1,y1), (x2,y2),..., (xN, yN) o target unknown function y = f(x) Problem: find a hypothesis h such that h ≈ f h is generalized well if it correctly predicts the value of y for novel examples (test set).
8
Supervised Learning When the output y is one of the finite set of values (sunny, cloudy, rainy), the learning problem is called classification. o Boolean or binary classification o e.g., spam detector, male/female face When y is a number (tomorrow’s temperature), the problem is called regression.
9
Inductive learning method The points are in the (x,y) plane, where y = f(x). We approximate f with h selected from a hypothesis space H. Construct/adjust h to agree with f on training set
10
Inductive learning method Construct/adjust h to agree with f on training set E.g., linear fitting:
11
Inductive learning method Construct/adjust h to agree with f on training set E.g., curve fitting:
12
Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
13
Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: How to choose from among multiple consistent hypotheses?
14
Inductive learning method Ockham’s razor: prefer the simplest hypothesis consistent with data (14 th -century English philosopher William of Ockham)
15
Learning decision trees One of the simplest and yet most successful forms of machine learning. A decision tree represents a function that takes as input a vector of attribute values and returns a “decision” – a single output. o discrete input, Boolean classification
16
Learning decision trees Problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1.Alternate : is there an alternative restaurant nearby? 2.Bar : is there a comfortable bar area to wait in? 3.Fri/Sat : is today Friday or Saturday? 4.Hungry : are we hungry? 5.Patrons : number of people in the restaurant (None, Some, Full) 6.Price : price range ($, $$, $$$) 7.Raining : is it raining outside? 8.Reservation : have we made a reservation? 9.Type : kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate : estimated waiting time (0-10, 10-30, 30-60, >60)
17
Attribute-based representations Examples described by attribute values A training set of 12 examples E.g., situations where I will/won't wait for a table: Classification of examples is positive (T) or negative (F)
18
Decision trees One possible representation for hypotheses (no Price and Type) “true” tree for deciding whether to wait:
19
Expressiveness Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row → path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example.
20
Goal: to find the most compact decision trees
21
21 Constructing the Decision Tree Goal: Find the smallest decision tree consistent with the examples Divide-and-conquer: o Test the most important attribute first, divides the problem up into smaller subproblems that can be solved recursively. o “Most important”: attribute that best splits examples
22
Attribute-based representations
25
Constructing the Decision Tree Form tree with root = best attribute For each value v i (or range) of best attribute Selects those examples with best=v i Construct subtree i by recursively calling decision tree with subset of examples, all attributes except best Add a branch to tree with label=v i and subtree=subtree i
26
Decision tree learning Aim: find a small tree consistent with the training examples Idea: (recursively) choose "most significant" attribute as root of (sub)tree
27
Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Which is a better choice?
28
28 Choosing the Best Attribute: Binary Classification Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction Information theory (Shannon and Weaver 49) o Entropy: a measure of uncertainty of a random variable A coin that always comes up heads --> 0 A flip of a fair coin (Heads or tails) --> 1(bit) The roll of a fair four-sided die --> 2(bit) o Information gain: the expected reduction in entropy caused by partitioning the examples according to this attribute
29
29 Formula for Entropy Examples: Suppose we have a collection of 10 examples, 5 positive, 5 negative: H(1/2,1/2) = -1/2log 2 1/2 -1/2log 2 1/2 = 1 bit Suppose we have a collection of 100 examples, 1 positive and 99 negative: H(1/100,99/100) = -.01log 2.01 -.99log 2.99 =.08 bits
30
Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Which is a better choice?
31
Information gain Information gain (from attribute test) = difference between the original information requirement and new requirement
32
Information gain Information gain (from attribute test) = difference between the original information requirement and new requirement Information Gain (IG) or reduction in entropy from the attribute test: Choose the attribute with the largest IG
33
Information gain For the training set, p = n = 6, H(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root
34
Example contd. Decision tree learned from the 12 examples: Substantially simpler than “true”
35
DayOutlookTemperatureHumidityWindPlayTennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalStrongYes D8SunnyMildHighWeakNo D9SunnyCoolNormalWeakYes D10RainMildNormalWeakYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.