Decision Trees MSE 2400 EaLiCaRA Dr. Tom Way.

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Knowledge Acquisition and Modelling
DECISION TREES. Decision trees  One possible representation for hypotheses.
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Decision Tree Learning - ID3
Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning III Decision Tree Induction
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Decision Trees.
Decision Trees Jeff Storey. Overview What is a Decision Tree Sample Decision Trees How to Construct a Decision Tree Problems with Decision Trees Decision.
DECISION TREES David Kauchak CS 451 – Fall Admin Assignment 1 available and is due on Friday (printed out at the beginning of class) Door code for.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Decision Tree Algorithm
Constructing Decision Trees. A Decision Tree Example The weather data example. ID codeOutlookTemperatureHumidityWindyPlay abcdefghijklmnabcdefghijklmn.
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Decision Trees an Introduction.
Decision Trees Chapter 18 From Data to Knowledge.
Three kinds of learning
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
ICS 273A Intro Machine Learning
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
ID3 Algorithm Allan Neymark CS157B – Spring 2007.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Decision Tree Learning
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
CpSc 810: Machine Learning Decision Tree Learning.
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
Statistical Classification Methods
Learning from observations
Decision-Tree Induction & Decision-Rule Induction
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Artificial Intelligence in Game Design N-Grams and Decision Tree Learning.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Decision Tree Learning
COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees.
Big Data Analysis and Mining Qinpei Zhao 赵钦佩 2015 Fall Decision Tree.
COM24111: Machine Learning Decision Trees Gavin Brown
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Learning From Observations Inductive Learning Decision Trees Ensembles.
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Classification Algorithms
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Jeff Storey.
Data Mining CSCI 307, Spring 2019 Lecture 15
Presentation transcript:

Decision Trees MSE 2400 EaLiCaRA Dr. Tom Way

MSE 2400 Evolution & Learning Decision Tree A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning What is a Decision Tree? An inductive learning task Use particular facts to make more generalized conclusions A predictive model based on a branching series of Boolean tests These smaller Boolean tests are less complex than a one-stage classifier Let’s look at a sample decision tree… MSE 2400 Evolution & Learning

Predicting Commute Time Leave At If we leave at 10 AM and there are no cars stalled on the road, what will our commute time be? 10 AM 9 AM 8 AM Stall? Accident? Long No Yes No Yes Short Long Medium Long MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning Inductive Learning In this decision tree, we made a series of Boolean decisions and followed the corresponding branch Did we leave at 10 AM? Did a car stall on the road? Is there an accident on the road? By answering each of these yes/no questions, we then came to a conclusion on how long our commute might take MSE 2400 Evolution & Learning

Decision Trees as Rules We did not have to represent this tree graphically We could have represented as a set of rules. However, this may be much harder to read… MSE 2400 Evolution & Learning

Decision Tree as a Rule Set if hour equals 8am: commute time is long else if hour equals 9am: if accident equals yes: else: commute time is medium else if hour equals 10am: if stall equals yes: commute time is short Notice that all attributes don’t have to be used on each path of the decision. Some attributes may not even appear in a tree. MSE 2400 Evolution & Learning

How to Create a Decision Tree First, make a list of attributes that we can measure These attributes (for now) must be discrete We then choose a target attribute that we want to predict Then create an experience table that lists what we have seen in the past MSE 2400 Evolution & Learning

Sample Experience Table Example Attributes Target   Hour Weather Accident Stall Commute D1 8 AM Sunny No Long D2 Cloudy Yes D3 10 AM Short D4 9 AM Rainy D5 D6 D7 D8 Medium D9 D10 D11 D12 D13 MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning Choosing Attributes The previous experience decision table showed 4 attributes: hour, weather, accident and stall But the decision tree only showed 3 attributes: hour, accident and stall Why is that? MSE 2400 Evolution & Learning

Choosing Attributes (1) Methods for selecting attributes (which will be described later) show that weather is not a discriminating attribute We use the principle of Occam’s Razor: Given a number of competing hypotheses, the simplest one is preferable MSE 2400 Evolution & Learning

Occam’s Razor 1852 A “razor” is a maxim or rule of thumb “Entities should not be multiplied unnecessarily.” entia non sunt multiplicanda praeter necessitatem MSE 2400 Evolution & Learning

Choosing Attributes (2) The basic structure of creating a decision tree is the same for most decision tree algorithms The difference lies in how we select the attributes for the tree We will focus on the ID3 algorithm developed by Ross Quinlan in 1975 MSE 2400 Evolution & Learning

Decision Tree Algorithms The basic idea behind any decision tree algorithm is as follows: Choose the best attribute(s) to split the remaining instances and make that attribute a decision node Repeat this process for recursively for each child Stop when: All the instances have the same target attribute value There are no more attributes There are no more instances MSE 2400 Evolution & Learning

Identifying the Best Attributes Refer back to our original decision tree Leave At Stall? Accident? 10 AM 9 AM 8 AM Long Short Medium No Yes How did we know to split on leave at then stall and accident and not weather? MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning ID3 Heuristic To determine the best attribute, we look at the ID3 heuristic ID3 splits attributes based on their entropy. Entropy is a measure of disinformation… MSE 2400 Evolution & Learning

This isn’t Entropy from Physics In Physics… Entropy describes the concept that all things tend to move from an ordered state to a disordered state Hot is more organized than cold Tidy is more organized than neat Life is more organized than death Society is more organized than anarchy MSE 2400 Evolution & Learning

Entropy in the real world (1) Entropy has to do with how rare or common an instance of information is It's natural for us to want to use fewer bits (send fewer messages) when reporting about common vs. rare events. How often do we hear about a safe plane landing making the evening news? How about a crash? Why? The crash is rarer! MSE 2400 Evolution & Learning

Entropy in the real world (2) Morse code – designed using entropy MSE 2400 Evolution & Learning

Entropy in the real world (3) Morse code decision tree MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning How Entropy Works Entropy is minimized when all values of the target attribute are the same. If we know that commute time will always be short, then entropy = 0 Entropy is maximized when there is an equal chance of all values for the target attribute (i.e. the result is random) If commute time = short in 3 instances, medium in 3 instances and long in 3 instances, entropy is maximized MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning Calculating Entropy Calculation of entropy Entropy(S) = ∑(i=1 to l)-|Si|/|S| * log2(|Si|/|S|) S = set of examples Si = subset of S with value vi under the target attribute l = size of the range of the target attribute MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning ID3 ID3 splits on attributes with the lowest entropy We calculate the entropy for all values of an attribute as the weighted sum of subset entropies as follows: ∑(i = 1 to k) |Si|/|S| Entropy(Si), where k is the range of the attribute we are testing We can also measure information gain (which is inversely proportional to entropy) as follows: Entropy(S) - ∑(i = 1 to k) |Si|/|S| Entropy(Si) MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning ID3 Given our commute time sample set, we can calculate the entropy of each attribute at the root node Attribute Expected Entropy Information Gain Hour 0.6511 0.768449 Weather 1.28884 0.130719 Accident 0.92307 0.496479 Stall 1.17071 0.248842 MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning Problems with ID3 ID3 is not optimal Uses expected entropy reduction, not actual reduction Must use discrete (or discretized) attributes What if we left for work at 9:30 AM? We could break down the attributes into smaller values… MSE 2400 Evolution & Learning

Problems with Decision Trees While decision trees classify quickly, the time for building a tree may be higher than another type of classifier Decision trees suffer from a problem of errors propagating throughout a tree A very serious problem as the number of classes increases MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning Error Propagation Since decision trees work by a series of local decisions, what happens when one of these local decisions is wrong? Every decision from that point on may be wrong We may never return to the correct path of the tree MSE 2400 Evolution & Learning

Error Propagation Example MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning Problems with ID3 If we broke down leave time to the minute, we might get something like this: 8:02 AM 8:03 AM 9:05 AM 9:07 AM 9:09 AM 10:02 AM Long Medium Short Long Long Short Since entropy is very low for each branch, we have n branches with n leaves. This would not be helpful for predictive modeling. MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning Problems with ID3 We can use a technique known as discretization We choose cut points, such as 9AM for splitting continuous attributes These cut points generally lie in a subset of boundary points, such that a boundary point is where two adjacent instances in a sorted list have different target value attributes MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning Problems with ID3 Consider the attribute commute time 8:00 (L), 8:02 (L), 8:07 (M), 9:00 (S), 9:20 (S), 9:25 (S), 10:00 (S), 10:02 (M) When we split on these attributes, we increase the entropy so we don’t have a decision tree with the same number of cut points as leaves MSE 2400 Evolution & Learning

A Decision Tree Example Weather data example. Should we play tennis? ID code Outlook Temperature Humidity Windy Play a b c d e f g h i j k l m n Sunny Overcast Rainy Hot Mild Cool High Normal False True No Yes MSE 2400 Evolution & Learning

Decision tree for weather data Outlook humidity windy yes no sunny overcast rainy high normal false true MSE 2400 Evolution & Learning

The Process of Constructing a Decision Tree Select an attribute to place at the root of the decision tree and make one branch for every possible value. Repeat the process recursively for each branch. MSE 2400 Evolution & Learning

Which Attribute Should Be Placed at a Certain Node One common approach is based on the information gained by placing a certain attribute at this node. MSE 2400 Evolution & Learning

Information Gained by Knowing the Result of a Decision In the weather data example, there are 9 instances of which the decision to play is “yes” and there are 5 instances of which the decision to play is “no’. Then, the information gained by knowing the result of the decision is MSE 2400 Evolution & Learning

The General Form for Calculating the Information Gain Entropy of a decision = P1, P2, …, Pn are the probabilities of the n possible outcomes. MSE 2400 Evolution & Learning

Information Further Required If “Outlook” Is Placed at the Root yes no sunny overcast rainy MSE 2400 Evolution & Learning

Information Gained by Placing Each of the 4 Attributes Gain(outlook) = 0.940 bits – 0.693 bits = 0.247 bits. Gain(temperature) = 0.029 bits. Gain(humidity) = 0.152 bits. Gain(windy) = 0.048 bits. MSE 2400 Evolution & Learning

The Strategy for Selecting an Attribute to Place at a Node Select the attribute that gives us the largest information gain. In this example, it is the attribute “Outlook”. Outlook 2 “yes” 3 “no” 4 “yes” 3 “yes” 2 “no” sunny overcast rainy MSE 2400 Evolution & Learning

The Recursive Procedure for Constructing a Decision Tree The operation discussed above is applied to each branch recursively to construct the decision tree. For example, for the branch “Outlook = Sunny”, we evaluate the information gained by applying each of the remaining 3 attributes. Gain(Outlook=sunny;Temperature) = 0.971 – 0.4 = 0.571 Gain(Outlook=sunny;Humidity) = 0.971 – 0 = 0.971 Gain(Outlook=sunny;Windy) = 0.971 – 0.951 = 0.02 MSE 2400 Evolution & Learning

Recursive Procedure (cont’d) Similarly, we also evaluate the information gained by applying each of the remaining 3 attributes for the branch “Outlook = rainy”. Gain(Outlook=rainy;Temperature) = 0.971 – 0.951 = 0.02 Gain(Outlook=rainy;Humidity) = 0.971 – 0.951 = 0.02 Gain(Outlook=rainy;Windy) =0.971 – 0 = 0.971 MSE 2400 Evolution & Learning

The Over-fitting Issue Over-fitting is caused by creating decision rules that work accurately on the training set based on insufficient quantity of samples. As a result, these decision rules may not work well in more general cases. Also called the “Training Effect” MSE 2400 Evolution & Learning

Example of the Over-fitting Problem in Decision Tree Construction 11 “Yes” and 9 “No” samples; prediction = “Yes” 8 “Yes” and 9 “No” samples; prediction = “No” 3 “Yes” and 0 “No” samples; prediction = “Yes” Ai=0 Ai=1 MSE 2400 Evolution & Learning

MSE 2400 Evolution & Learning Summary Decision trees can be used to help predict the future based on past experience… That is… an example of machine learning Trees are easy to understand Decision trees work more efficiently with discrete attributes Trees may suffer from error propagation MSE 2400 Evolution & Learning