INTRODUCTION TO Machine Learning 2nd Edition

INTRODUCTION TO Machine Learning 2nd Edition
Lecture Slides for INTRODUCTION TO Machine Learning 2nd Edition ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

CHAPTER 9: Decision Trees
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Decision Trees A decision tree An efficient nonparametric method
A hierarchical model Divided-and–conquer strategy Supervised learning Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Tree Uses Nodes, and Leaves

Divide and Conquer Numeric xi : Discrete xi : Internal decision nodes
Univariate: Uses a single attribute, xi Numeric xi : Binary split : xi > wm Discrete xi : n-way split for n possible values Multivariate: Uses all attributes, x Leaves Classification: Class labels, or proportions Regression: Numeric; r average, or local fit Learning is greedy; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Classification Trees For node m,
Nm instances reach m, Nim belong to Ci Node m is pure if pim is 0 or 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Entropy Measure of impurity is entropy
Entropy in information theory specifies the minimum number of bits needed to encode the classification accuracy of an instance. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Example of Entropy In a two-class problem If p1 = 1 and p2 = 0
all examples are of C1 we do not need to send anything  the entropy is 0 If p1 = p2 = 0.5 we need to send a bit to signal one of the two cases  the entropy is 1 In between these two extremes, we can devise codes and use less than a bit per message by having shorter codes for the more likely class and longer codes for the less likely. Entropy function for a two-class problem Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

The Properties of Measure Functions
The properties of functions measuring the impurity of a split: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Examples Examples of 2-class measure functions are Entropy :
Gini index : Misclassification error : Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Best Split If node m is pure, generate a leaf and stop,
otherwise split and continue recursively Impurity after split: Nmj of Nm take branch j, Nimj belong to Ci Find the variable and split that min impurity (among all variables -- and split positions for numeric variables) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Feature type: Hu moments Internal nodes: 16 Leaf nodes: 17 Height: 8
yes no crying yawning laughing dazing vomiting The decision tree is constructed by the correlation coefficients of Hu moments. Each internal node includes a decision rule, and each leaf node includes one class of the facial expression. Here we only show one image of the sequences in each class. The height of this tree is 8. There are 16 internal nodes and 17 leaf nodes. Two leaf nodes may represent a same facial expression. For example, this node represents a turn-right dazing class and this node also represents a similar class. Feature type: Hu moments Internal nodes: 16 Leaf nodes: 17 Height: 8

Feature type: R moments Internal nodes: 15 Leaf nodes: 17 Height: 10
yes vomiting yawning dazing crying laughing The decision tree is constructed by the correlation coefficients of R moments. The height of this tree is 10. There are 15 internal nodes and 17 leaf nodes. Two leaf nodes may represent a same facial expression. For example, this node represents a dazing class and this node also represents a similar class.

Feature type: Zernike moments Internal nodes: 19 Leaf nodes: 20
Height: 7 no yes crying vomiting laughing dazing yawning The decision tree is constructed by the correlation coefficients of Zernike moments. The height of this tree is 7. There are 19 internal nodes and 20 leaf nodes.

Regression Trees Error at node m:
After splitting: (the error should decrease) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Model Selection in Trees

Pruning Trees Prepruning: Postpruning: Early stopping
Remove subtrees for better generalization (decrease variance) Prepruning: Early stopping Postpruning: Grow the whole tree then prune subtrees which overfit on the pruning set Prepruning is faster, postpruning is more accurate (requires a separate pruning set) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Rule Extraction from Trees
C4.5Rules (Quinlan, 1993) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Learning Rules Rule induction is similar to tree induction but
tree induction is breadth-first, rule induction is depth-first; one rule at a time Rule set contains rules; rules are conjunctions of terms A rule covers an example if all terms of the rule evaluate to true for the example. A rule is said to cover an example if the example satisfies all the conditions of the rule. Sequential covering: Generate rules one at a time until all positive examples are covered Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Ripper Algorithm There are two kinds of loop in Ripper algorithm (Cohen, 1995): Outer loop : adding one rule at a time to the rule base Inner loop : adding one condition at a time to the current rule Conditions are added to the rule to maximize an information gain measure. Conditions are added to the rule until it covers no negative example. The pseudo-code of the outer loop of Ripper is given in Figure 9.7. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

O(Nlog2N) DL: description length of the rule base
The description length of a rule base = (the sum of the description lengths of all the rules in the rule base) + (the description of the instances not covered by the rule base) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Ripper Algorithm In Ripper, conditions are added to the rule to
Maximize an information gain measure R : the original rule R’ : the candidate rule after adding a condition N (N’): the number of instances that are covered by R (R’) N+ (N’+): the number of true positives in R (R’) s : the number of true positives in R and R’ (after adding the condition) Until it covers no negative example p and n : the number of true and false positives respectively. Rule value metric Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Multivariate Trees > 0

INTRODUCTION TO Machine Learning 2nd Edition

Similar presentations

Presentation on theme: "INTRODUCTION TO Machine Learning 2nd Edition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INTRODUCTION TO Machine Learning 2nd Edition

Similar presentations

Presentation on theme: "INTRODUCTION TO Machine Learning 2nd Edition"— Presentation transcript:

Similar presentations

About project

Feedback