Artificial Intelligence 6. Decision Tree Learning

Slides:

Advertisements

Similar presentations

Artificial Intelligence 12. Two Layer ANNs

Advertisements

Artificial Intelligence 11. Decision Tree Learning Course V231 Department of Computing Imperial College, London © Simon Colton.

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)

Decision Trees Decision tree representation ID3 learning algorithm

1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

ICS320-Foundations of Adaptive and Learning Systems

Classification Techniques: Decision Tree Learning

Chapter 7 – Classification and Regression Trees

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

Machine Learning Group University College Dublin Decision Trees What is a Decision Tree? How to build a good one…

Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.

Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.

Induction of Decision Trees

Fall 2004 TDIDT Learning CS478 - Machine Learning.

Machine Learning Chapter 3. Decision Tree Learning

Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.

Artificial Intelligence 7. Decision trees

Mohammad Ali Keyvanrad

Chapter 9 – Classification and Regression Trees

CpSc 810: Machine Learning Decision Tree Learning.

Decision Trees. Decision trees Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is.

Longin Jan Latecki Temple University

Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.

CS690L Data Mining: Classification

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.

CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Decision Trees, Part 1 Reading: Textbook, Chapter 6.

Decision Tree Learning

Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.

Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.

CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.

Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.

Decision Tree Learning

Chapter 6 Decision Tree.

Machine Learning Inductive Learning and Decision Trees

DECISION TREES An internal node represents a test on an attribute.

Decision Trees an introduction.

CS 9633 Machine Learning Decision Tree Learning

Chapter 18 From Data to Knowledge

Decision Tree Learning

Decision trees (concept learnig)

Machine Learning Lecture 2: Decision Tree Learning.

Decision trees (concept learnig)

Classification Algorithms

Decision Tree Learning

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

Artificial Intelligence

Decision Trees (suggested time: 30 min)

Ch9: Decision Trees 9.1 Introduction A decision tree:

Decision Tree Saed Sayad 9/21/2018.

Classification and Prediction

Machine Learning Chapter 3. Decision Tree Learning

Machine Learning: Lecture 3

Decision Trees Decision tree representation ID3 learning algorithm

Machine Learning Chapter 3. Decision Tree Learning

Decision Trees.

Decision trees.

Decision Trees Decision tree representation ID3 learning algorithm

Artificial Intelligence 12. Two Layer ANNs

Data Classification for Data Mining

INTRODUCTION TO Machine Learning 2nd Edition

©Jiawei Han and Micheline Kamber

Decision Trees Jeff Storey.

Data Mining CSCI 307, Spring 2019 Lecture 9

Presentation transcript:

Artificial Intelligence 6. Decision Tree Learning Course IAT813 Simon Fraser University Steve DiPaola Material adapted : S. Colton / Imperial C.

What to do this Weekend? If my parents are visiting If not We’ll go to the cinema If not Then, if it’s sunny I’ll play tennis But if it’s windy and I’m rich, I’ll go shopping If it’s windy and I’m poor, I’ll go to the cinema If it’s rainy, I’ll stay in

Written as a Decision Tree Root of tree Leaves

Using the Decision Tree (No parents on a Sunny Day)

From Decision Trees to Logic Decision trees can be written as Horn clauses in first order logic Read from the root to every tip If this and this and this … and this, then do this In our example: If no_parents and sunny_day, then play_tennis no_parents  sunny_day  play_tennis

Decision Tree Learning Overview Decision tree can be seen as rules for performing a categorisation E.g., “what kind of weekend will this be?” Remember that we’re learning from examples Not turning thought processes into decision trees We need examples put into categories We also need attributes for the examples Attributes describe examples (background knowledge) Each attribute takes only a finite set of values

The ID3 Algorithm - Overview The major question in decision tree learning Which nodes to put in which positions Including the root node and the leaf nodes ID3 uses a measure called Information Gain Based on a notion of entropy “Impurity in the data” Used to choose which node to put in next Node with the highest information gain is chosen When there are no choices, a leaf node is put on

Entropy – General Idea From Tom Mitchell’s book: “In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy that characterizes the (im)purity of an arbitrary collection of examples” Want a notion of impurity in data Imagine a set of boxes and balls in them If all balls are in one box This is nicely ordered – so scores low for entropy Calculate entropy by summing over all boxes Boxes with very few in scores low Boxes with almost all examples in scores low

Entropy - Formulae Given a set of examples, S For examples in a binary categorisation Where p+ is the proportion of positives And p- is the proportion of negatives For examples in categorisations c1 to cn Where pn is the proportion of examples in cn

Entropy - Explanation Each category adds to the whole measure When pi is near to 1 (Nearly) all the examples are in this category So it should score low for its bit of the entropy log2(pi) gets closer and closer to 0 And this part dominates the overall calculation So the overall calculation comes to nearly 0 (which is good) When pi is near to 0 (Very) few examples are in this category log2(pi) gets larger (more negative), but does not dominate Hence overall calculation comes to nearly 0 (which is good)

Information Gain Given set of examples S and an attribute A A can take values v1 … vm Let Sv = {examples which take value v for attribute A} Calculate Gain(S,A) Estimates the reduction in entropy we get if we know the value of attribute A for the examples in S

An Example Calculation of Information Gain Suppose we have a set of examples S = {s1, s2, s3, s4} In a binary categorisation With one positive example and three negative examples The positive example is s1 And Attribute A Which takes values v1, v2, v3 S1 takes value v2 for A, S2 takes value v2 for A S3 takes value v3 for A, S4 takes value v1 for A

First Calculate Entropy(S) Recall that Entropy(S) = -p+log2(p+) – p-log2(p-) From binary categorisation, we know that p+ = ¼ and p- = ¾ Hence Entropy(S) = -(1/4)log2(1/4) – (3/4)log2(3/4) = 0.811 Note for users of old calculators: May need to use the fact that log2(x) = ln(x)/ln(2) And also note that, by convention: 0*log2(0) is taken to be 0

Calculate Gain for each Value of A Remember that And that Sv = {set of example with value V for A} So, Sv1 = {s4}, Sv2 = {s1,s2}, Sv3={s3} Now, (|Sv1|/|S|) * Entropy(Sv1) = (1/4) * (-(0/1)*log2(0/1)-(1/1)log2(1/1)) = (1/4) * (0 - (1)log2(1)) = (1/4)(0-0) = 0 Similarly, (|Sv2|/|S|) = 0.5 and (|Sv3|/|S|) = 0

Final Calculation So, we add up the three calculations and take them from the overall entropy of S: Final answer for information gain: Gain(S,A) = 0.811 – (0+1/2+0) = 0.311

The ID3 Algorithm Given a set of examples, S Described by a set of attributes Ai Categorised into categories cj 1. Choose the root node to be attribute A Such that A scores highest for information gain Relative to S, i.e., gain(S,A) is the highest over all attributes 2. For each value v that A can take Draw a branch and label each with corresponding v Then see the options in the next slide!

The ID3 Algorithm For each branch you’ve just drawn (for value v) If Sv only contains examples in category c Then put that category as a leaf node in the tree If Sv is empty Then find the default category (which contains the most examples from S) Put this default category as a leaf node in the tree Otherwise Remove A from attributes which can be put into nodes Replace S with Sv Find new attribute A scoring best for Gain(S, A) Start again at part 2 Make sure you replace S with Sv

Explanatory Diagram

A Worked Example Weekend Weather Parents Money Decision (Category) W1 Sunny Yes Rich Cinema W2 No Tennis W3 Windy W4 Rainy Poor W5 Stay in W6 W7 W8 Shopping W9 W10

Information Gain for All of S S = {W1,W2,…,W10} Firstly, we need to calculate: Entropy(S) = … = 1.571 (see notes) Next, we need to calculate information gain For all the attributes we currently have available (which is all of them at the moment) Gain(S, weather) = … = 0.7 Gain(S, parents) = … = 0.61 Gain(S, money) = … = 0.2816 Hence, the weather is the first attribute to split on Because this gives us the biggest information gain

Top of the Tree So, this is the top of our tree: Now, we look at each branch in turn In particular, we look at the examples with the attribute prescribed by the branch Ssunny = {W1,W2,W10} Categorisations are cinema, tennis and tennis for W1,W2 and W10 What does the algorithm say? Set is neither empty, nor a single category So we have to replace S by Ssunny and start again

Working with Ssunny Need to choose a new attribute to split on Weekend Weather Parents Money Decision W1 Sunny Yes Rich Cinema W2 No Tennis W10 Need to choose a new attribute to split on Cannot be weather, of course – we’ve already had that So, calculate information gain again: Gain(Ssunny, parents) = … = 0.918 Gain(Ssunny, money) = … = 0 Hence we choose to split on parents

Getting to the leaf nodes If it’s sunny and the parents have turned up Then, looking at the table in previous slide There’s only one answer: go to cinema If it’s sunny and the parents haven’t turned up Then, again, there’s only one answer: play tennis Hence our decision tree looks like this:

Avoiding Overfitting Decision trees can be learned to perfectly fit the data given This is probably overfitting The answer is a memorisation, rather than generalisation Avoidance method 1: Stop growing the tree before it reaches perfection Avoidance method 2: Grow to perfection, then prune it back aftwerwards Most useful of two methods in practice

Appropriate Problems for Decision Tree learning From Tom Mitchell’s book: Background concepts describe examples in terms of attribute-value pairs, values are always finite in number Concept to be learned (target function) Has discrete values Disjunctive descriptions might be required in the answer Decision tree algorithms are fairly robust to errors In the actual classifications In the attribute-value pairs In missing information