Download presentation
Presentation is loading. Please wait.
1
Decision Tree Learning
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning
2
Contents Introduction Decision Tree representation
Appropriate problems for Decision Tree learning The basic Decision Tree learning algorithm (ID3) Hypothesis space search in Decision Tree learning Inductive bias in Decision Tree learning Issues in Decision Tree learning Summary
3
Introduction Widely used practical methods for inductive inference
Approximating discrete valued functions Search in a completely expressive hypothesis space Inductive bias: prefering small trees to large ones Robust to noisy data and capable of learning disjunctive expressions Learned trees can also be re-represented as a set of if-then-rules Algorithms: ID3, ASSISTENT, C4.5
4
Contents Introduction Decision Tree representation
Appropriate problems for Decision Tree learning The basic Decision Tree learning algorithm (ID3) Hypothesis space search in Decision Tree learning Inductive bias in Decision Tree learning Issues in Decision Tree learning Summary
5
Decision Tree representation
A tree classifies instances: Node: an attribute which describes an instance Branch: possible values of the attribute Leaf: class to which the instances belong Procedure (of classifying): An instance is classified by starting at the root node of the tree Repeat: - test the attribute specified by the node move down the tree branch corresponding to the value of the attribute-value in the given example Example: classified as negative example In general: a decision tree is a disjunction of constraints on the attribute values of the instances
6
Decision Tree Representation 2
7
Contents Introduction Decision Tree Representation
Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary
8
Appropriate Problems for Decision Tree Learning
Decision tree learning is generally best suited to the problems: Instances are represented by attribute-value tuples: easiest: each attribute takes on a small number of disjoint possible values extension: handling real valued attributes The target function has discrete output values: extension 1: learning function with more than two possible output values Disjunctive description may be required The training data may contain error: error in the classification of the training examples error in the attribute values that describe these example The training data may contain missing attribute values Classification Problems: Problems in which the task is to classify examples into one of the of possible categories
9
Contents Introduction Decision Tree Representation
Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary
10
The Basic Decision Tree Learning Algorithm
Top-Down, greedy search through the space of possible decision trees ID3 (Quinlan 1986), C45 (Quinlan 1993) and other variations Question: Which attributes should be tested at a node of the tree? Answer: Statistical test to select the best attribute (how well it alone classifies the training examples) Descendants of the root node created (each possible value of this attribute and training examples are sorted to the appropriate descendant node) Process is then repeated Algorithm never backtracks to reconsider earlier choices
11
The Basic Decision Tree Learning Algorithm 2
ID3 (examples, attributes) Begin Create root node if (examples = +) return root(+) if (examples = -) return root(-) if (attributes = empty) return root(most common value of the target_attr in examples) begin A = Gain(examples, attributes) attr(root) = A forall vi of A do Add_subtree(root, vi) examples_vi = (examples|value = vi) if (examples_vi = empty) Add_Leaf(most common value of target_attr in examples) else below this new branch add subtree ID3(examples_vi, attributes - {A}) end Return root
12
Which Attribute Is the Best Classifier
INFORMATION GAIN: How well the given separates attribute the training examples: ENTROPY: Characterizes the (im)purity of an arbitrary collection of examples Given: a collection S of positive and negative examples : proportion of positive examples in S : proportion of negative examples in S Example: [9+, 5-] Notice: Entropy is 0 if all members belong to the same class Entropy is 1 when the collection contains an equal number of positive and negative examples
13
Which Attribute Is the Best Classifier
Entropy: specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S Generally: is the proportion of S belonging to the class i The entropy function relative to a boolean classification, as the proportion of positive examples, varies between 0 and 1 Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning
14
Information Gain Measures the Expected Reduction in Entropy
INFORMATION GAIN Gain(S,A): Expected reduction in entropy caused by partitioning the examples according to this attribute A: Values(A) is the set of all possible values for A Example: Values(Wind) ={Weak,Strong} S= [9+,5-] ,
15
Information Gain Measures the Expected Reduction in Entropy 2
16
An Illustrative Example
ID3 determines the information gain for each candidate outputs Gain(S,Outlook) = Gain(S, Humidity) = Gain(S,Wind) = Gain(S,Temperature) = 0.029 Outlook provides the best predection; Outlook= Overcast all examples are positive
17
An Illustrative Example 2
18
An Illustrative Example 3
The process continues for each new leaf node until: Every attribute has already been included along the path through the tree The training examples associated with this leaf node have all the same target attribute values
19
Contents Introduction Decision Tree Representation
Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary
20
Hypothesis Space Search in Decision Tree Learning
Hypothesis space for ID3: The set of possible decision trees ID3 performs simple to complex hill-climbing search through the hypothesis space Beginning: empty tree Considering: more elaborate hypothesis Evaluation function: Information gain
21
Hypothesis Space Search in Decision Tree Learning 2
Capabilities and limitations: ID3's hypothesis space of all decision trees is the complete space of finite discrete-valued functions, relative to the available attributes => every finite discrete-valued function can be represented by decision trees => avoids: hypothesis space might not contain the target function Maintains only single current hypothesis <=> Candidate-Elimination Algorithm No Backtracking in the search => converging to locally optimal solution Using all training examples at each step => resulting search is much less sensitive to errors in individual training examples
22
Contents Introduction Decision Tree Representation
Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary
23
Inductive Bias in Decision Tree Learning
INDUCTIVE BIAS: Set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances. In ID3: Basis is how to choose one consistent hypothesis over the others. ID3 search strategy: Selects in favour shorter trees over larger ones Selects tree where the attribute with highest information gain is closest to the root Difficult to characterise bias precisely but approximately: Shorter trees are prefered over larger Could imagine algorithms like ID3 but make breadth-first search (BSF-ID3) ID3 can be viewed as an efficient approximation of BSF-ID3 but it exhibits more complex bias. It does not always find the shortest tree.
24
Inductive Bias in Decision Tree Learning
A closer approximation to the inductive bias of ID3: Shorter trees are prefered over longer trees. Tree that place high information gain attributes close to the root are prefered over those that do not. Occam's razor: Prefer the simplest hypothesis that fits the data
25
Restriction Biases and Preference Biases
Difference of inductive bias exhibited by ID3 and Candidate-Elimination: ID3 searches a complete hypothesis space incompletely Candidate-Elimination searches an incomplete hypothesis space completely Inductive bias of ID3 follows from its search strategy. Inductive bias of Candidate-Elimination Algorithm follows from the definition of its search space Inductive bias of ID3 is thus a preference to certain hypotheses over others Bias of Candidate-Elimination algorithm is considered in form of the categorical restriction on the set of hypotheses Typically preference bias is more desirable than a restriction bias (learner can work within the complete hypothesis space) Restriction bias (strictly limit the set of potential hypothesis) generally less desirable (possibility of excluding the unknown target function)
26
Contents Introduction Decision Tree Representation
Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary
27
Issues in Decision Tree Learning
Include: Determining how deep grows the decision tree Handling continuous attributes Choosing an appropriate attribute-selection measure Handling training data with missing attribute-values Handling attributes with different costs Improving computational efficiency
28
Avoid Overfitting the Date
Definition: Given a hypothesis space H, a hypothesis is said to overfit the training data if there exists some alternative hypothesis , such that h has smaller error than h' over the training examples, but h' has a smaller error over the entire distribution of instances than h
29
Avoiding Overfitting the Data 2
How can it be possible that a tree h better fits the training examples than h' but it performs more poorly over a subsequent examples? Training examples contain random errors or noise example: adding following positive training example labeled incorrectly as negative result: sorted where D9 and D11 but ID3 search for further refinement Small numbers of examples are associated with leaf nodes (coincidental regularities) Experimental study of ID3 involving five different learning tasks (noisy, nondeterministic) overfitting decrease the accuracy 10-20% APPROACHES: Stop the growing of the tree earlier, before it reaches the point where it perfectly classifies the training data Allow the tree to overfit the data and then post-prune the tree
30
Avoiding Overfitting the Data 3
Criterion to determine the correct, final tree size: Training and validation set: Use a set of examples separated from the training examples to evaluate the utility of post-pruning nodes from tree Use all the available data for training, but apply a statistical test to estimate whether expanding (pruning) a particular node is likely to produce an improvement beyond the training set Use an explicit measure of the complexity for encoding the training examples and decision tree
31
Reduced Error Pruning How exactly might a validation set be used to prevent overfitting? Reduced-error pruning: Consider each of the decision nodes to be a candidate for pruning Pruning means to substitute a subtree rooted at the node, by a leaf which the most common class of the training examples assigned Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set Nodes are pruned iteratively by choosing the node whose removal most increases the accurancy of decision tree over validation set Continue until further pruning is necessary
32
Reduced Error Pruning 2 Here the validation set used for pruning is distinct from both the training and test sets Disadvantage: Data is limited (withholding part of it for the validation set reduces even further the number of examples available for training) Many additional techniques have been proposed
33
Rule Post-Pruning FOLLOWING STEP: Example:
Infer decision tree growing until the training data fit as well as possible and allow overfitting to occur Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root to a leaf node Prune each rule by removing any preconditions that result in improving its estimated accuracy Sort the pruned rules by their estimated accuracy and consider them in this sequence when classifying subsequent instances Example:
34
Rule Post-Prunning 2 One rule is generated for each leaf node in the tree Antecedent: Each attribute test along the path from the root to the leaf Consequent: The classification at the leaf Example: Removing any ancendent, whose removal does not worsen its estimated accuracy: Example: removing C4.5 evaluates performance by using a pessimistic estimate: Calculating the rule accuracy over the training example Calculating the standard deviation in the estimated accuracy assuming a binomial distribution For the given confidencial level the lower-bound estimate is then taken as measure of rule performance Advantage: For large data sets the pessimistic estimate is very close to the observed accuracy
35
Rule Post-Prunning 3 Why is it good to convert decision tree to rules before prunning? Allows distinguishing among the different contexts in which decision tree is used 1 path = 1 rule => pruning can be made differently for each path Removes the distinction between attribute test that occur near the root of the tree and those that occur near to leaves Avoid the reorganisation of the tree if the root node is pruned Converting to rules improves readability Rules are often easier to understand for people
36
Summary Decision trees are a practical method for concept learning and for learning other discrete-valued function Infers decision trees ID3 searches the complete hypothesis space => avoids that the target function might be not presented in the hypothesis space Inductive bias implicit in ID3 includes preference for smaller trees Overfitting Extension
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.