Decision Tree Learning

Decision Tree Learning
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning

Contents Introduction Decision Tree representation
Appropriate problems for Decision Tree learning The basic Decision Tree learning algorithm (ID3) Hypothesis space search in Decision Tree learning Inductive bias in Decision Tree learning Issues in Decision Tree learning Summary

Introduction Widely used practical methods for inductive inference
Approximating discrete valued functions Search in a completely expressive hypothesis space Inductive bias: prefering small trees to large ones Robust to noisy data and capable of learning disjunctive expressions Learned trees can also be re-represented as a set of if-then-rules Algorithms: ID3, ASSISTENT, C4.5

Contents Introduction Decision Tree representation
Appropriate problems for Decision Tree learning The basic Decision Tree learning algorithm (ID3) Hypothesis space search in Decision Tree learning Inductive bias in Decision Tree learning Issues in Decision Tree learning Summary

Decision Tree representation
A tree classifies instances: Node: an attribute which describes an instance Branch: possible values of the attribute Leaf: class to which the instances belong Procedure (of classifying): An instance is classified by starting at the root node of the tree Repeat: - test the attribute specified by the node move down the tree branch corresponding to the value of the attribute-value in the given example Example: classified as negative example In general: a decision tree is a disjunction of constraints on the attribute values of the instances

Decision Tree Representation 2

Contents Introduction Decision Tree Representation
Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary

Appropriate Problems for Decision Tree Learning
Decision tree learning is generally best suited to the problems: Instances are represented by attribute-value tuples: easiest: each attribute takes on a small number of disjoint possible values extension: handling real valued attributes The target function has discrete output values: extension 1: learning function with more than two possible output values Disjunctive description may be required The training data may contain error: error in the classification of the training examples error in the attribute values that describe these example The training data may contain missing attribute values Classification Problems: Problems in which the task is to classify examples into one of the of possible categories

The Basic Decision Tree Learning Algorithm
Top-Down, greedy search through the space of possible decision trees ID3 (Quinlan 1986), C45 (Quinlan 1993) and other variations Question: Which attributes should be tested at a node of the tree? Answer: Statistical test to select the best attribute (how well it alone classifies the training examples) Descendants of the root node created (each possible value of this attribute and training examples are sorted to the appropriate descendant node) Process is then repeated Algorithm never backtracks to reconsider earlier choices

The Basic Decision Tree Learning Algorithm 2
ID3 (examples, attributes) Begin Create root node if (examples = +) return root(+) if (examples = -) return root(-) if (attributes = empty) return root(most common value of the target_attr in examples) begin A = Gain(examples, attributes) attr(root) = A forall vi of A do Add_subtree(root, vi) examples_vi = (examples|value = vi) if (examples_vi = empty) Add_Leaf(most common value of target_attr in examples) else below this new branch add subtree ID3(examples_vi, attributes - {A}) end Return root

Which Attribute Is the Best Classifier
INFORMATION GAIN: How well the given separates attribute the training examples: ENTROPY: Characterizes the (im)purity of an arbitrary collection of examples Given: a collection S of positive and negative examples : proportion of positive examples in S : proportion of negative examples in S Example: [9+, 5-] Notice: Entropy is 0 if all members belong to the same class Entropy is 1 when the collection contains an equal number of positive and negative examples

Which Attribute Is the Best Classifier
Entropy: specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S Generally: is the proportion of S belonging to the class i The entropy function relative to a boolean classification, as the proportion of positive examples, varies between 0 and 1 Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning

Information Gain Measures the Expected Reduction in Entropy
INFORMATION GAIN Gain(S,A): Expected reduction in entropy caused by partitioning the examples according to this attribute A: Values(A) is the set of all possible values for A Example: Values(Wind) ={Weak,Strong} S= [9+,5-] ,

Information Gain Measures the Expected Reduction in Entropy 2

An Illustrative Example
ID3 determines the information gain for each candidate outputs Gain(S,Outlook) = Gain(S, Humidity) = Gain(S,Wind) = Gain(S,Temperature) = 0.029 Outlook provides the best predection; Outlook= Overcast all examples are positive

An Illustrative Example 2

An Illustrative Example 3
The process continues for each new leaf node until: Every attribute has already been included along the path through the tree The training examples associated with this leaf node have all the same target attribute values

Hypothesis Space Search in Decision Tree Learning
Hypothesis space for ID3: The set of possible decision trees ID3 performs simple to complex hill-climbing search through the hypothesis space Beginning: empty tree Considering: more elaborate hypothesis Evaluation function: Information gain

Hypothesis Space Search in Decision Tree Learning 2
Capabilities and limitations: ID3's hypothesis space of all decision trees is the complete space of finite discrete-valued functions, relative to the available attributes => every finite discrete-valued function can be represented by decision trees => avoids: hypothesis space might not contain the target function Maintains only single current hypothesis <=> Candidate-Elimination Algorithm No Backtracking in the search => converging to locally optimal solution Using all training examples at each step => resulting search is much less sensitive to errors in individual training examples

Inductive Bias in Decision Tree Learning
INDUCTIVE BIAS: Set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances. In ID3: Basis is how to choose one consistent hypothesis over the others. ID3 search strategy: Selects in favour shorter trees over larger ones Selects tree where the attribute with highest information gain is closest to the root Difficult to characterise bias precisely but approximately: Shorter trees are prefered over larger Could imagine algorithms like ID3 but make breadth-first search (BSF-ID3) ID3 can be viewed as an efficient approximation of BSF-ID3 but it exhibits more complex bias. It does not always find the shortest tree.

Inductive Bias in Decision Tree Learning
A closer approximation to the inductive bias of ID3: Shorter trees are prefered over longer trees. Tree that place high information gain attributes close to the root are prefered over those that do not. Occam's razor: Prefer the simplest hypothesis that fits the data

Restriction Biases and Preference Biases
Difference of inductive bias exhibited by ID3 and Candidate-Elimination: ID3 searches a complete hypothesis space incompletely Candidate-Elimination searches an incomplete hypothesis space completely Inductive bias of ID3 follows from its search strategy. Inductive bias of Candidate-Elimination Algorithm follows from the definition of its search space Inductive bias of ID3 is thus a preference to certain hypotheses over others Bias of Candidate-Elimination algorithm is considered in form of the categorical restriction on the set of hypotheses Typically preference bias is more desirable than a restriction bias (learner can work within the complete hypothesis space) Restriction bias (strictly limit the set of potential hypothesis) generally less desirable (possibility of excluding the unknown target function)

Issues in Decision Tree Learning
Include: Determining how deep grows the decision tree Handling continuous attributes Choosing an appropriate attribute-selection measure Handling training data with missing attribute-values Handling attributes with different costs Improving computational efficiency

Avoid Overfitting the Date
Definition: Given a hypothesis space H, a hypothesis is said to overfit the training data if there exists some alternative hypothesis , such that h has smaller error than h' over the training examples, but h' has a smaller error over the entire distribution of instances than h

Avoiding Overfitting the Data 2
How can it be possible that a tree h better fits the training examples than h' but it performs more poorly over a subsequent examples? Training examples contain random errors or noise example: adding following positive training example labeled incorrectly as negative result: sorted where D9 and D11 but ID3 search for further refinement Small numbers of examples are associated with leaf nodes (coincidental regularities) Experimental study of ID3 involving five different learning tasks (noisy, nondeterministic) overfitting decrease the accuracy 10-20% APPROACHES: Stop the growing of the tree earlier, before it reaches the point where it perfectly classifies the training data Allow the tree to overfit the data and then post-prune the tree

Avoiding Overfitting the Data 3
Criterion to determine the correct, final tree size: Training and validation set: Use a set of examples separated from the training examples to evaluate the utility of post-pruning nodes from tree Use all the available data for training, but apply a statistical test to estimate whether expanding (pruning) a particular node is likely to produce an improvement beyond the training set Use an explicit measure of the complexity for encoding the training examples and decision tree

Reduced Error Pruning How exactly might a validation set be used to prevent overfitting? Reduced-error pruning: Consider each of the decision nodes to be a candidate for pruning Pruning means to substitute a subtree rooted at the node, by a leaf which the most common class of the training examples assigned Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set Nodes are pruned iteratively by choosing the node whose removal most increases the accurancy of decision tree over validation set Continue until further pruning is necessary

Reduced Error Pruning 2 Here the validation set used for pruning is distinct from both the training and test sets Disadvantage: Data is limited (withholding part of it for the validation set reduces even further the number of examples available for training) Many additional techniques have been proposed

Rule Post-Pruning FOLLOWING STEP: Example:
Infer decision tree growing until the training data fit as well as possible and allow overfitting to occur Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root to a leaf node Prune each rule by removing any preconditions that result in improving its estimated accuracy Sort the pruned rules by their estimated accuracy and consider them in this sequence when classifying subsequent instances Example:

Rule Post-Prunning 2 One rule is generated for each leaf node in the tree Antecedent: Each attribute test along the path from the root to the leaf Consequent: The classification at the leaf Example: Removing any ancendent, whose removal does not worsen its estimated accuracy: Example: removing C4.5 evaluates performance by using a pessimistic estimate: Calculating the rule accuracy over the training example Calculating the standard deviation in the estimated accuracy assuming a binomial distribution For the given confidencial level the lower-bound estimate is then taken as measure of rule performance Advantage: For large data sets the pessimistic estimate is very close to the observed accuracy

Rule Post-Prunning 3 Why is it good to convert decision tree to rules before prunning? Allows distinguishing among the different contexts in which decision tree is used 1 path = 1 rule => pruning can be made differently for each path Removes the distinction between attribute test that occur near the root of the tree and those that occur near to leaves Avoid the reorganisation of the tree if the root node is pruned Converting to rules improves readability Rules are often easier to understand for people

Summary Decision trees are a practical method for concept learning and for learning other discrete-valued function Infers decision trees ID3 searches the complete hypothesis space => avoids that the target function might be not presented in the hypothesis space Inductive bias implicit in ID3 includes preference for smaller trees Overfitting Extension

Decision Tree Learning

Similar presentations

Presentation on theme: "Decision Tree Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decision Tree Learning

Similar presentations

Presentation on theme: "Decision Tree Learning"— Presentation transcript:

Similar presentations

About project

Feedback