Decision Tree Learning

Slides:



Advertisements
Similar presentations
Concept Learning and the General-to-Specific Ordering
Advertisements

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Decision Trees Decision tree representation ID3 learning algorithm
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
Decision Tree Approach in Data Mining
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Decision Tree Algorithm
CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.
Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics.
Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.
Induction of Decision Trees
Evaluating Hypotheses
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
1 Interacting with Data Materials from a Course in Princeton University -- Hu Yan.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Ch 3. Decision Tree Learning
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Decision Tree Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Decision tree learning
By Wang Rui State Key Lab of CAD&CG
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Mohammad Ali Keyvanrad
For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.
For Monday Read chapter 18, sections 5-6 Homework: –Chapter 18, exercises 1-2.
Decision tree learning Maria Simi, 2010/2011 Inductive inference with decision trees  Decision Trees is one of the most widely used and practical methods.
Decision Trees & the Iterative Dichotomiser 3 (ID3) Algorithm David Ramos CS 157B, Section 1 May 4, 2006.
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
Machine Learning Lecture 10 Decision Tree Learning 1.
CpSc 810: Machine Learning Decision Tree Learning.
Learning from Observations Chapter 18 Through
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
Decision Tree Learning
Decision Trees. Decision trees Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Tree Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
Chap. 10 Learning Sets of Rules 박성배 서울대학교 컴퓨터공학과.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
1 By: Ashmi Banerjee (125186) Suman Datta ( ) CSE- 3rd year.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning Inductive Learning and Decision Trees
CS 9633 Machine Learning Decision Tree Learning
Decision Tree Learning
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Berlin Chen
Presentation transcript:

Decision Tree Learning Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning

Contents Introduction Decision Tree representation Appropriate problems for Decision Tree learning The basic Decision Tree learning algorithm (ID3) Hypothesis space search in Decision Tree learning Inductive bias in Decision Tree learning Issues in Decision Tree learning Summary

Introduction Widely used practical methods for inductive inference Approximating discrete valued functions Search in a completely expressive hypothesis space Inductive bias: prefering small trees to large ones Robust to noisy data and capable of learning disjunctive expressions Learned trees can also be re-represented as a set of if-then-rules Algorithms: ID3, ASSISTENT, C4.5

Contents Introduction Decision Tree representation Appropriate problems for Decision Tree learning The basic Decision Tree learning algorithm (ID3) Hypothesis space search in Decision Tree learning Inductive bias in Decision Tree learning Issues in Decision Tree learning Summary

Decision Tree representation A tree classifies instances: Node: an attribute which describes an instance Branch: possible values of the attribute Leaf: class to which the instances belong Procedure (of classifying): An instance is classified by starting at the root node of the tree Repeat: - test the attribute specified by the node - move down the tree branch corresponding to the value of the attribute-value in the given example Example: classified as negative example In general: a decision tree is a disjunction of constraints on the attribute values of the instances

Decision Tree Representation 2

Contents Introduction Decision Tree Representation Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary

Appropriate Problems for Decision Tree Learning Decision tree learning is generally best suited to the problems: Instances are represented by attribute-value tuples: easiest: each attribute takes on a small number of disjoint possible values extension: handling real valued attributes The target function has discrete output values: extension 1: learning function with more than two possible output values Disjunctive description may be required The training data may contain error: error in the classification of the training examples error in the attribute values that describe these example The training data may contain missing attribute values Classification Problems: Problems in which the task is to classify examples into one of the of possible categories

Contents Introduction Decision Tree Representation Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary

The Basic Decision Tree Learning Algorithm Top-Down, greedy search through the space of possible decision trees ID3 (Quinlan 1986), C45 (Quinlan 1993) and other variations Question: Which attributes should be tested at a node of the tree? Answer: Statistical test to select the best attribute (how well it alone classifies the training examples) Descendants of the root node created (each possible value of this attribute and training examples are sorted to the appropriate descendant node) Process is then repeated Algorithm never backtracks to reconsider earlier choices

The Basic Decision Tree Learning Algorithm 2 ID3 (examples, attributes) Begin Create root node if (examples = +) return root(+) if (examples = -) return root(-) if (attributes = empty) return root(most common value of the target_attr in examples) begin A = Gain(examples, attributes) attr(root) = A forall vi of A do Add_subtree(root, vi) examples_vi = (examples|value = vi) if (examples_vi = empty) Add_Leaf(most common value of target_attr in examples) else below this new branch add subtree ID3(examples_vi, attributes - {A}) end Return root

Which Attribute Is the Best Classifier INFORMATION GAIN: How well the given separates attribute the training examples: ENTROPY: Characterizes the (im)purity of an arbitrary collection of examples Given: a collection S of positive and negative examples : proportion of positive examples in S : proportion of negative examples in S Example: [9+, 5-] Notice: Entropy is 0 if all members belong to the same class Entropy is 1 when the collection contains an equal number of positive and negative examples

Which Attribute Is the Best Classifier Entropy: specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S Generally: is the proportion of S belonging to the class i The entropy function relative to a boolean classification, as the proportion of positive examples, varies between 0 and 1 Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning

Information Gain Measures the Expected Reduction in Entropy INFORMATION GAIN Gain(S,A): Expected reduction in entropy caused by partitioning the examples according to this attribute A: Values(A) is the set of all possible values for A Example: Values(Wind) ={Weak,Strong} S= [9+,5-] ,

Information Gain Measures the Expected Reduction in Entropy 2

An Illustrative Example ID3 determines the information gain for each candidate outputs Gain(S,Outlook) = 0.246 Gain(S, Humidity) = 0.151 Gain(S,Wind) = 0.048 Gain(S,Temperature) = 0.029 Outlook provides the best predection; Outlook= Overcast all examples are positive

An Illustrative Example 2

An Illustrative Example 3 The process continues for each new leaf node until: Every attribute has already been included along the path through the tree The training examples associated with this leaf node have all the same target attribute values

Contents Introduction Decision Tree Representation Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary

Hypothesis Space Search in Decision Tree Learning Hypothesis space for ID3: The set of possible decision trees ID3 performs simple to complex hill-climbing search through the hypothesis space Beginning: empty tree Considering: more elaborate hypothesis Evaluation function: Information gain

Hypothesis Space Search in Decision Tree Learning 2 Capabilities and limitations: ID3's hypothesis space of all decision trees is the complete space of finite discrete-valued functions, relative to the available attributes => every finite discrete-valued function can be represented by decision trees => avoids: hypothesis space might not contain the target function Maintains only single current hypothesis <=> Candidate-Elimination Algorithm No Backtracking in the search => converging to locally optimal solution Using all training examples at each step => resulting search is much less sensitive to errors in individual training examples

Contents Introduction Decision Tree Representation Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary

Inductive Bias in Decision Tree Learning INDUCTIVE BIAS: Set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances. In ID3: Basis is how to choose one consistent hypothesis over the others. ID3 search strategy: Selects in favour shorter trees over larger ones Selects tree where the attribute with highest information gain is closest to the root Difficult to characterise bias precisely but approximately: Shorter trees are prefered over larger Could imagine algorithms like ID3 but make breadth-first search (BSF-ID3) ID3 can be viewed as an efficient approximation of BSF-ID3 but it exhibits more complex bias. It does not always find the shortest tree.

Inductive Bias in Decision Tree Learning A closer approximation to the inductive bias of ID3: Shorter trees are prefered over longer trees. Tree that place high information gain attributes close to the root are prefered over those that do not. Occam's razor: Prefer the simplest hypothesis that fits the data

Restriction Biases and Preference Biases Difference of inductive bias exhibited by ID3 and Candidate-Elimination: ID3 searches a complete hypothesis space incompletely Candidate-Elimination searches an incomplete hypothesis space completely Inductive bias of ID3 follows from its search strategy. Inductive bias of Candidate-Elimination Algorithm follows from the definition of its search space Inductive bias of ID3 is thus a preference to certain hypotheses over others Bias of Candidate-Elimination algorithm is considered in form of the categorical restriction on the set of hypotheses Typically preference bias is more desirable than a restriction bias (learner can work within the complete hypothesis space) Restriction bias (strictly limit the set of potential hypothesis) generally less desirable (possibility of excluding the unknown target function)

Contents Introduction Decision Tree Representation Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary

Issues in Decision Tree Learning Include: Determining how deep grows the decision tree Handling continuous attributes Choosing an appropriate attribute-selection measure Handling training data with missing attribute-values Handling attributes with different costs Improving computational efficiency

Avoid Overfitting the Date Definition: Given a hypothesis space H, a hypothesis is said to overfit the training data if there exists some alternative hypothesis , such that h has smaller error than h' over the training examples, but h' has a smaller error over the entire distribution of instances than h

Avoiding Overfitting the Data 2 How can it be possible that a tree h better fits the training examples than h' but it performs more poorly over a subsequent examples? Training examples contain random errors or noise example: adding following positive training example labeled incorrectly as negative result: sorted where D9 and D11 but ID3 search for further refinement Small numbers of examples are associated with leaf nodes (coincidental regularities) Experimental study of ID3 involving five different learning tasks (noisy, nondeterministic) overfitting decrease the accuracy 10-20% APPROACHES: Stop the growing of the tree earlier, before it reaches the point where it perfectly classifies the training data Allow the tree to overfit the data and then post-prune the tree

Avoiding Overfitting the Data 3 Criterion to determine the correct, final tree size: Training and validation set: Use a set of examples separated from the training examples to evaluate the utility of post-pruning nodes from tree Use all the available data for training, but apply a statistical test to estimate whether expanding (pruning) a particular node is likely to produce an improvement beyond the training set Use an explicit measure of the complexity for encoding the training examples and decision tree

Reduced Error Pruning How exactly might a validation set be used to prevent overfitting? Reduced-error pruning: Consider each of the decision nodes to be a candidate for pruning Pruning means to substitute a subtree rooted at the node, by a leaf which the most common class of the training examples assigned Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set Nodes are pruned iteratively by choosing the node whose removal most increases the accurancy of decision tree over validation set Continue until further pruning is necessary

Reduced Error Pruning 2 Here the validation set used for pruning is distinct from both the training and test sets Disadvantage: Data is limited (withholding part of it for the validation set reduces even further the number of examples available for training) Many additional techniques have been proposed

Rule Post-Pruning FOLLOWING STEP: Example: Infer decision tree growing until the training data fit as well as possible and allow overfitting to occur Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root to a leaf node Prune each rule by removing any preconditions that result in improving its estimated accuracy Sort the pruned rules by their estimated accuracy and consider them in this sequence when classifying subsequent instances Example:

Rule Post-Prunning 2 One rule is generated for each leaf node in the tree Antecedent: Each attribute test along the path from the root to the leaf Consequent: The classification at the leaf Example: Removing any ancendent, whose removal does not worsen its estimated accuracy: Example: removing C4.5 evaluates performance by using a pessimistic estimate: Calculating the rule accuracy over the training example Calculating the standard deviation in the estimated accuracy assuming a binomial distribution For the given confidencial level the lower-bound estimate is then taken as measure of rule performance Advantage: For large data sets the pessimistic estimate is very close to the observed accuracy

Rule Post-Prunning 3 Why is it good to convert decision tree to rules before prunning? Allows distinguishing among the different contexts in which decision tree is used 1 path = 1 rule => pruning can be made differently for each path Removes the distinction between attribute test that occur near the root of the tree and those that occur near to leaves Avoid the reorganisation of the tree if the root node is pruned Converting to rules improves readability Rules are often easier to understand for people

Summary Decision trees are a practical method for concept learning and for learning other discrete-valued function Infers decision trees ID3 searches the complete hypothesis space => avoids that the target function might be not presented in the hypothesis space Inductive bias implicit in ID3 includes preference for smaller trees Overfitting Extension