Machine Learning: Decision Tree Learning

Machine Learning: Decision Tree Learning
CMPT 420 / CMPG 720

Learning from Examples
An agent is learning if it improves its performance on future tasks after making observations about the world. One class of learning problem: from a collection of input-output pairs, learn a function that predicts the output for new inputs. e.g., weather forecast, games

Why learning? The designer cannot anticipate all changes
A program designed to predict tomorrow’s stock market prices must learn to adapt when conditions change. Programmers sometimes have no idea how to program a solution recognizing faces

Types of Learning Supervised learning Unsupervised learning
example input-output pairs and learns a function e.g., spam detector Unsupervised learning correct answers not given e.g., clustering

Supervised Learning Learning a function/rule from specific input-output pairs is also called inductive learning. Given a training set of N example pairs: (x1,y1), (x2,y2), ..., (xN, yN) target unknown function y = f(x) Problem: find a hypothesis h such that h ≈ f h is generalized well if it correctly predicts the value of y for novel examples (test set).

Supervised Learning When the output y is one of the finite set of values (sunny, cloudy, rainy), the learning problem is called classification. Boolean or binary classification e.g., spam detector, male/female face When y is a number (tomorrow’s temperature), the problem is called regression.

Inductive learning method
The points are in the (x,y) plane, where y = f(x). We approximate f with h. Construct/adjust h to agree with f on training set

Construct/adjust h to agree with f on training set E.g., linear fitting:

Construct/adjust h to agree with f on training set E.g., curve fitting:

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: How to choose from among multiple consistent hypotheses?

Ockham’s razor: prefer the simplest hypothesis consistent with data (14th-century English philosopher William of Ockham)

Learning decision trees
One of the simplest and yet most successful forms of machine learning. A decision tree represents a function that takes as input a vector of attribute values and returns a “decision” – a single output.

Learning decision trees
Problem: decide whether to wait for a table at a restaurant, based on the following attributes: Alternate: is there an alternative restaurant nearby? Bar: is there a comfortable bar area to wait in? Fri/Sat: is today Friday or Saturday? Hungry: are we hungry? Patrons: number of people in the restaurant (None, Some, Full) Price: price range ($, $$, $$$) Raining: is it raining outside? Reservation: have we made a reservation? Type: kind of restaurant (French, Italian, Thai, Burger) WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

Attribute-based representations
Examples described by attribute values A training set of 12 examples E.g., situations where I will/won't wait for a table: Classification of examples is positive (T) or negative (F)

Decision tree

Decision tree no Price and Type

Goal: to find the most compact decision tree

Constructing the Decision Tree
Recursion: divides the problem up into smaller subproblems that can be solved recursively.

Constructing the Decision Tree
Recursion: Test the most important attribute first, divides the problem up into smaller subproblems that can be solved recursively.

Choosing a good attribute
Which is a better choice?

Attribute-based representations

Choosing the Best Attribute:
Information theory (Shannon and Weaver 49) Entropy: a measure of uncertainty of a random variable A coin that always comes up heads --> 0 A flip of a fair coin (Heads or tails) --> 1(bit) The roll of a fair four-sided die --> 2(bit)

Formula for Entropy Suppose we have a collection of 10 examples,
5 positive, 5 negative: H(1/2,1/2) = -1/2log21/2 -1/2log21/2 = 1 bit Suppose we have a collection of 100 examples, 1 positive and 99 negative: H(1/100,99/100) = -.01log log2.99 = .08 bits

Choosing a good attribute
Which is a better choice?

Information gain Information gain (from attribute test) = difference between the original information requirement and new requirement Choose the attribute with the largest IG

Example contd. Decision tree learned from the 12 examples:

Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Strong D3 Overcast Yes D4 Rain Mild D5 Cool Normal D6 D7 D8 D9 D10 D11 D12 D13 D14

Machine Learning: Decision Tree Learning

Similar presentations

Presentation on theme: "Machine Learning: Decision Tree Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning: Decision Tree Learning

Similar presentations

Presentation on theme: "Machine Learning: Decision Tree Learning"— Presentation transcript:

Similar presentations

About project

Feedback