Decision Trees and Decision Tree Learning Philipp Kärger

Slides:

Advertisements

Similar presentations

DECISION TREES. Decision trees  One possible representation for hypotheses.

Advertisements

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)

Decision Trees Decision tree representation ID3 learning algorithm

1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Decision Tree Algorithm (C4.5)

Data Mining Classification This lecture node is modified based on Lecture Notes for Chapter 4/5 of Introduction to Data Mining by Tan, Steinbach, Kumar,

ICS320-Foundations of Adaptive and Learning Systems

Classification Techniques: Decision Tree Learning

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,

Lecture outline Classification Decision-tree classification.

Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.

Decision Tree Algorithm

CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.

Induction of Decision Trees Blaž Zupan, Ivan Bratko magix.fri.uni-lj.si/predavanja/uisp.

Ensemble Learning: An Introduction

1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.

Decision Trees Decision tree representation Top Down Construction

Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.

Ch 3. Decision Tree Learning

MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Decision tree learning

Induction of Decision Trees Blaž Zupan and Ivan Bratko magix.fri.uni-lj.si/predavanja/uisp.

Machine Learning Chapter 3. Decision Tree Learning

Induction of Decision Trees. An Example Data Set and Decision Tree yes no yesno sunnyrainy no med yes smallbig outlook company sailboat.

Decision tree learning Maria Simi, 2010/2011 Inductive inference with decision trees  Decision Trees is one of the most widely used and practical methods.

Chapter 9 – Classification and Regression Trees

Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.

Machine Learning Lecture 10 Decision Tree Learning 1.

CpSc 810: Machine Learning Decision Tree Learning.

Decision-Tree Induction & Decision-Rule Induction

For Wednesday No reading Homework: –Chapter 18, exercise 6.

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.

Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.

CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.

Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.

Decision Tree Learning

Machine Learning: Decision Trees Homework 4 assigned courtesy: Geoffrey Hinton, Yann LeCun, Tan, Steinbach, Kumar.

Big Data Analysis and Mining Qinpei Zhao 赵钦佩 2015 Fall Decision Tree.

Decision Trees.

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Induction of Decision Trees Blaž Zupan and Ivan Bratko magix.fri.uni-lj.si/predavanja/uisp.

Induction of Decision Trees

Machine Learning Inductive Learning and Decision Trees

Decision Trees an introduction.

Università di Milano-Bicocca Laurea Magistrale in Informatica

Decision Tree Learning

Decision trees (concept learnig)

Machine Learning Lecture 2: Decision Tree Learning.

Decision trees (concept learnig)

Classification Algorithms

Decision Tree Saed Sayad 9/21/2018.

Introduction to Data Mining, 2nd Edition by

Introduction to Data Mining, 2nd Edition by

Introduction to Data Mining, 2nd Edition by

Machine Learning Chapter 3. Decision Tree Learning

Machine Learning: Lecture 3

Decision Trees Decision tree representation ID3 learning algorithm

Machine Learning Chapter 3. Decision Tree Learning

Induction of Decision Trees

Decision Trees.

Decision Trees Decision tree representation ID3 learning algorithm

Presentation transcript:

Decision Trees and Decision Tree Learning Philipp Kärger

Outline: Decision Trees Decision Tree Learning Overfitting ID3 Algorithm Which attribute to split on? Some examples Overfitting Where to use Decision Trees?

Decision tree representation for PlayTennis Outlook Sunny Overcast Rain Humidity Yes Wind Normal Strong Weak High No Yes No Yes

Decision tree representation for PlayTennis Outlook Attribute Sunny Overcast Rain Humidity Yes Wind Normal Strong Weak High No Yes No Yes

Decision tree representation for PlayTennis Outlook Value Sunny Overcast Rain Humidity Yes Wind Normal Strong Weak High No Yes No Yes

Decision tree representation for PlayTennis Outlook Classification Sunny Overcast Rain Humidity Yes Wind Normal Strong Weak High No Yes No Yes

Decision tree representation for PlayTennis Outlook Sunny Overcast Rain Humidity Yes Wind Normal Strong Weak High No Yes No Yes

PlayTennis: Other representations Logical expression for PlayTennis=Yes: (Outlook=Sunny  Humidity=Normal)  (Outlook=Overcast)  (Outlook=Rain  Wind=Weak) If-then rules IF Outlook=Sunny  Humidity=Normal THEN PlayTennis=Yes IF Outlook=Overcast THEN PlayTennis=Yes IF Outlook=Rain  Wind=Weak THEN PlayTennis=Yes IF Outlook=Sunny  Humidity=High THEN PlayTennis=No IF Outlook=Rain  Wind=Strong THEN PlayTennis=Yes

Decision Trees - Summary a model of a part of the world allows us to classify instances (by performing a sequence of tests) allows us to predict classes of (unseen) instances understandable by humans (unlike many other representations)

Decision Tree Learning

Goal: Learn from known instances how to classify unseen instances by means of building and exploiting a Decision Tree supervised or unsupervised learning?

Classification Task Application: classification of medical patients by their disease seen patients Decision Tree unseen patients rules telling which attributes of the patient indicates a disease check attributes of an unseen patient

Exercise: create two decision trees sunny odd day play tennis yes no

Basic algorithm: ID3 (simplified) ID3 = Iterative Dichotomiser 3 - given a goal class to build the tree for - create a root node for the tree - if all examples from the test set belong to the same goal class C then label the root with C - else select the ‘most informative’ attribute A split the training set according to the values V1..Vn of A recursively build the resulting subtrees T1 … Tn generate decision tree T: A1=weather A2=day happy sun odd yes rain no even A Humidity vn v1 ... Low High T1 ... Tn No Yes

finding the right attribute A to split on is tricky lessons learned: there is always more than one decision tree finding the “best” one is NP complete all the known algorithms use heuristics finding the right attribute A to split on is tricky

Decision trees -Binary decision trees Since each inequality that is used to split the input space is only based on one input variable. Each node draws a boundary that can be geometrically interpreted as a hyperplane perpendicular to the axis. B C

Search heuristics in ID3 Which attribute should we split on? Need a heuristic Some function gives big numbers for “good” splits Want to get to “pure” sets How can we measure “pure”? odd even sunny rain

E(S) = - p+ log2p+ - p- log2p- Entropy S - example set, C1,...,CN - classes Entropy E(S) – measure of the impurity of training set S pc = probability of class Cc Entropy in binary classification problems E(S) = - p+ log2p+ - p- log2p-

Entropy E(S) = - p+ log2p+ - p- log2p- The entropy function relative to a Boolean classification, as the proportion p+ of positive examples varies between 0 and 1

p+ ( - log2p+ ) + p- ( - log2p- ) = - p+ log2p+ - p- log2p- What is entropy? Entropy E(S) = expected amount of information (in bits) needed to assign a class to a randomly drawn object in S under the optimal, shortest-length code Information theory: optimal length code assigns -log2p bits to a message having probability p So, in binary classification problems, the expected number of bits to encode + or – of a random member of S is: p+ ( - log2p+ ) + p- ( - log2p- ) = - p+ log2p+ - p- log2p-

PlayTennis: Training examples

PlayTennis: Entropy Training set S: 14 examples (9 pos., 5 neg.) Notation: S = [9+, 5-] E(S) = - p+ log2p+ - p- log2p- Computing entropy, if probability is estimated by relative frequency E([9+,5-]) = - (9/14) log2(9/14) - (5/14) log2(5/14) = 0.940

PlayTennis: Entropy E(S) = - p+ log2p+ - p- log2p- E(9+,5-) = -(9/14) log2(9/14) - (5/14) log2(5/14) = 0.940 {D1,D2,D8,D9,D11} [2+, 3-] E=0.970 Sunny Outlook? Overcast {D3,D7,D12,D13} [4+, 0-] E=0 Rain {D4,D5,D6,D10,D14} [3+, 2-] E=0.970 [3+, 4-] E=0.985 High Humidity? Normal [6+, 1-] E=0.592 [6+, 2-] E=0.811 Weak Wind? Strong [3+, 3-] E=1.00

Maximizing Purity and Minimizing Disorder Select attribute which partitions the learning set into subsets as “pure” as possible Knowing the ``when’’ attribute values provides larger information gain than ``where’’. Therefore the ``when’’ attribute should be chosen for testing prior to the ``where’’ attribute.

Splitting Based on INFO... Information Gain: Parent Node, p is split into k partitions; ni is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

Splitting Based on INFO... Gain Ratio: Parent Node, p is split into k partitions ni is the number of records in partition i Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain

Measuring Information: Entropy The average amount of information I needed to classify an object is given by the entropy measure For a two-class problem: p(c) = probability of class Cc (sum over all classes) Tell us how to measure the information needed for classification. draw the formula to the board. Tell the intuition: measures the disorder in the dataset = measure of the impurity of training set S entropy p(c)

Information-Theoretic Approach To classify an object, a certain information is needed I, information After we have learned the value of A, we only need some remaining amount of information to classify an object Ires, residual information The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes the Gain: Gain Gain(A) = I – Ires(A) What does it mean to have Ires = 0? odd even sunny rain

What is the entropy of the set of happy/unhappy days? A1=weather A2=day happy sun odd yes rain no even odd even sunny rain

Residual Information After applying attribute A, S is partitioned into subsets according to values v of A Ires represents the amount of information still needed to classify an instance Ires is equal to weighted sum of the amounts of information for the subsets p(c|v) = probability that an instance belongs to class C given that it belongs to v =I(v)

What is Ires(A) if I split for “weather” and if I split for “day”? A1=weather A2=day happy sun odd yes rain no even odd even sunny rain Ires(weather) = 0 Ires(day) = 1

Information Gain: = the amount of information I rule out by splitting on attribute A: Gain(A) = I – Ires(A) = information in the current set minus the residual information after splitting The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes the Gain

Triangles and Squares

A set of classified objects Triangles and Squares Data Set: A set of classified objects . . . . . .

Entropy 5 triangles 9 squares class probabilities entropy of the data set . . . . . .

Entropy reduction by data set partitioning . . red yellow green Color? Entropy reduction by data set partitioning . .

. . . . . . . red Color? green residual information . yellow .

. . . . . . . red Information Gain Color? green . yellow .

Information Gain of The Attribute Attributes Gain(Color) = 0.246 Gain(Outline) = 0.151 Gain(Dot) = 0.048 Heuristics: attribute with the highest gain is chosen This heuristics is local (local minimization of impurity)

Gain(Outline) = 0.971 – 0 = 0.971 bits red Color? green . yellow . Gain(Outline) = 0.971 – 0 = 0.971 bits Gain(Dot) = 0.971 – 0.951 = 0.020 bits

Gain(Outline) = 0.971 – 0.951 = 0.020 bits red Gain(Outline) = 0.971 – 0.951 = 0.020 bits Gain(Dot) = 0.971 – 0 = 0.971 bits Color? green . yellow . solid . Outline? dashed .

. . . Dot? . Color? . . . Outline? . red yes no green yellow solid dashed .

Decision Tree . . . . . . Color Dot square Outline triangle square red green yellow Dot square Outline yes no dashed solid triangle square triangle square

A Defect of Ires Ires favors attributes with many values Such attribute splits S to many subsets, and if these are small, they will tend to be pure anyway One way to rectify this is through a corrected measure of information gain ratio. A1=weather A2=day happy sun 17.1.08 yes rain 18.1.08 no 19.1.08 20.1.08 21.1.08

Information Gain Ratio I(A) is amount of information needed to determine the value of an attribute A Information gain ratio

Information Gain Ratio . . . . . . . red Color? green Information Gain Ratio . yellow .

Information Gain and Information Gain Ratio

Overfitting (Example)

Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large

Overfitting due to Noise Decision boundary is distorted by noise point

Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records

How to Address Overfitting Idea: prune the tree so that it is not too specific Two possibilities: Pre-Pruning - prune while building the tree Post-Pruning - prune after building the tree

How to Address Overfitting Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree More restrictive stopping conditions: Stop if number of instances is less than some user-specified threshold Stop if expanding the current node does not improve impurity measures (e.g., information gain). Not successful in practice

How to Address Overfitting… Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree

Example of Post-Pruning Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 9/30 Pessimistic error (After splitting) = (9 + 4  0.5)/30 = 11/30 PRUNE! Class = Yes 20 Class = No 10 Error = 10/30 Class = Yes 8 Class = No 4 Class = Yes 3 Class = No 4 Class = Yes 4 Class = No 1 Class = Yes 5 Class = No 1

Examples of Post-pruning Case 1: C0: 11 C1: 3 C0: 2 C1: 4 Optimistic error? Pessimistic error? Reduced error pruning? Don’t prune for both cases Don’t prune case 1, prune case 2 Case 2: C0: 14 C1: 3 C0: 2 C1: 2 Depends on validation set

Oblique Decision Trees x + y < 1 Class = + Class = Test condition may involve multiple attributes More expressive representation Finding optimal test condition is computationally expensive

Occam’s Razor Given two models of similar generalization errors, one should prefer the simpler model over the more complex model For complex models, there is a greater chance that it was fitted accidentally by errors in data Therefore, one should prefer less complex models in general

When to use Decision Tree Learning?

Appropriate problems for decision tree learning Classification problems Characteristics: instances described by attribute-value pairs target function has discrete output values training data may be noisy training data may contain missing attribute values

Strengths can generate understandable rules perform classification without much computation can handle continuous and categorical variables provide a clear indication of which fields are most important for prediction or classification

Weakness Not suitable for prediction of continuous attribute. Perform poorly with many class and small data. Computationally expensive to train. At each node, each candidate splitting field must be sorted before its best split can be found. In some algorithms, combinations of fields are used and a search must be made for optimal combining weights. Pruning algorithms can also be expensive since many potential sub-trees must be formed and compared