Machine Learning: Decision Tree Learning CMPT 420 / CMPG 720
Learning from Examples An agent is learning if it improves its performance on future tasks after making observations about the world. One class of learning problem: from a collection of input-output pairs, learn a function that predicts the output for new inputs. e.g., weather forecast, games
Why learning? The designer cannot anticipate all changes A program designed to predict tomorrow’s stock market prices must learn to adapt when conditions change. Programmers sometimes have no idea how to program a solution recognizing faces
Types of Learning Supervised learning Unsupervised learning example input-output pairs and learns a function e.g., spam detector Unsupervised learning correct answers not given e.g., clustering
Supervised Learning Learning a function/rule from specific input-output pairs is also called inductive learning. Given a training set of N example pairs: (x1,y1), (x2,y2), ..., (xN, yN) target unknown function y = f(x) Problem: find a hypothesis h such that h ≈ f h is generalized well if it correctly predicts the value of y for novel examples (test set).
Supervised Learning When the output y is one of the finite set of values (sunny, cloudy, rainy), the learning problem is called classification. Boolean or binary classification e.g., spam detector, male/female face When y is a number (tomorrow’s temperature), the problem is called regression.
Inductive learning method The points are in the (x,y) plane, where y = f(x). We approximate f with h. Construct/adjust h to agree with f on training set
Inductive learning method Construct/adjust h to agree with f on training set E.g., linear fitting:
Inductive learning method Construct/adjust h to agree with f on training set E.g., curve fitting:
Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: How to choose from among multiple consistent hypotheses?
Inductive learning method Ockham’s razor: prefer the simplest hypothesis consistent with data (14th-century English philosopher William of Ockham)
Learning decision trees One of the simplest and yet most successful forms of machine learning. A decision tree represents a function that takes as input a vector of attribute values and returns a “decision” – a single output.
Learning decision trees Problem: decide whether to wait for a table at a restaurant, based on the following attributes: Alternate: is there an alternative restaurant nearby? Bar: is there a comfortable bar area to wait in? Fri/Sat: is today Friday or Saturday? Hungry: are we hungry? Patrons: number of people in the restaurant (None, Some, Full) Price: price range ($, $$, $$$) Raining: is it raining outside? Reservation: have we made a reservation? Type: kind of restaurant (French, Italian, Thai, Burger) WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Attribute-based representations Examples described by attribute values A training set of 12 examples E.g., situations where I will/won't wait for a table: Classification of examples is positive (T) or negative (F)
Decision tree
Decision tree no Price and Type
Goal: to find the most compact decision tree
Constructing the Decision Tree Recursion: divides the problem up into smaller subproblems that can be solved recursively.
Constructing the Decision Tree Recursion: Test the most important attribute first, divides the problem up into smaller subproblems that can be solved recursively.
Choosing a good attribute Which is a better choice?
Attribute-based representations
Attribute-based representations
Attribute-based representations
Choosing the Best Attribute: Information theory (Shannon and Weaver 49) Entropy: a measure of uncertainty of a random variable A coin that always comes up heads --> 0 A flip of a fair coin (Heads or tails) --> 1(bit) The roll of a fair four-sided die --> 2(bit)
Formula for Entropy Suppose we have a collection of 10 examples, 5 positive, 5 negative: H(1/2,1/2) = -1/2log21/2 -1/2log21/2 = 1 bit Suppose we have a collection of 100 examples, 1 positive and 99 negative: H(1/100,99/100) = -.01log2.01 -.99log2.99 = .08 bits
Choosing a good attribute Which is a better choice?
Information gain Information gain (from attribute test) = difference between the original information requirement and new requirement Choose the attribute with the largest IG
Example contd. Decision tree learned from the 12 examples:
Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Strong D3 Overcast Yes D4 Rain Mild D5 Cool Normal D6 D7 D8 D9 D10 D11 D12 D13 D14