Download presentation
Presentation is loading. Please wait.
Published bySabina Carpenter Modified over 9 years ago
1
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014
2
Example: Age, Income and Owning a flat 2 Monthly income (thousand rupees) Age Training set Owns a house Does not own a house If the training data was as above – Could we define some simple rules by observation? Any point above the line L 1 Owns a house Any point to the right of L 2 Owns a house Any other point Does not own a house L1L1 L2L2
3
Example: Age, Income and Owning a flat 3 Monthly income (thousand rupees) Age Training set Owns a house Does not own a house L1L1 L2L2 Root node: Split at Income = 101 Income ≥ 101: Label = Yes Income < 101: Split at Age = 54 Age ≥ 54: Label = YesAge < 54: Label = No In general, the data won’t be such as above
4
Example: Age, Income and Owning a flat 4 Monthly income (thousand rupees) Age Training set Owns a house Does not own a house Approach: recursively split the data into partitions so that each partition becomes purer till … How to decide the split? How to measure purity? When to stop?
5
Approach for splitting What are the possible lines for splitting? – For each variable, midpoints between pairs of consecutive values for the variable – How many? – If N = number of points in training set and m = number of variables – About O(N × m) How to choose which line to use for splitting? – The line which reduce impurity (~ heterogeneity of composition) the most How to measure impurity? 5
6
Gini Index for Measuring Impurity Suppose there are C classes Let p(i|t) = fraction of observations belonging to class i in rectangle (node) t Gini index: 6 If all observations in t belong to one single class Gini(t) = 0 When is Gini(t) maximum?
7
Entropy Average amount of information contained From another point of view – average amount of information expected – hence amount of uncertainty – We will study this in more detail later Entropy: 7 Where 0 log 2 0 is defined to be 0
8
Classification Error What if we stop the tree building at a node – That is, do not create any further branches for that node – Make that node a leaf – Classify the node with the most frequent class present in the node Classification error as measure of impurity 8 This rectangle (node) is still impure Intuitively – the impurity of the most frequent class in the rectangle (node)
9
The Full Blown Tree Recursive splitting Suppose we don’t stop until all nodes are pure A large decision tree with leaf nodes having very few data points – Does not represent classes well – Overfitting Solution: – Stop earlier, or – Prune back the tree 9 Root 1000 400 600 200 240 160 2 2 1 1 5 5 Number of points Statistically not significant
10
Prune back Pruning step: collapse leaf nodes and make the immediate parent a leaf node Effect of pruning – Lose purity of nodes – But were they really pure or was that a noise? – Too many nodes ≈ noise Trade-off between loss of purity and gain in complexity 10 Leaf node (label = Y) Freq = 5 Leaf node (label = Y) Freq = 5 Decision node (Freq = 7) Decision node (Freq = 7) Leaf node (label = B) Freq = 2 Leaf node (label = B) Freq = 2 Leaf node (label = Y) Freq = 7 Leaf node (label = Y) Freq = 7 Prune
11
Prune back: cost complexity Cost complexity of a (sub)tree: Classification error (based on training data) and a penalty for size of the tree 11 Leaf node (label = Y) Freq = 5 Leaf node (label = Y) Freq = 5 Decision node (Freq = 7) Decision node (Freq = 7) Leaf node (label = B) Freq = 2 Leaf node (label = B) Freq = 2 Leaf node (label = Y) Freq = 7 Leaf node (label = Y) Freq = 7 Prune Err(T) is the classification error L(T) = number of leaves in T Penalty factor α is between 0 and 1 – If α=0, no penalty for bigger tree
12
Different Decision Tree Algorithms Chi-square Automatic Interaction Detector (CHAID) – Gordon Kass (1980) – Stop subtree creation if not statistically significant by chi-square test Classification and Regression Trees (CART) – Breiman et al. – Decision tree building by Gini’s index Iterative Dichotomizer 3 (ID3) – Ross Quinlan (1986) – Splitting by information gain (difference in entropy) C4.5 – Quinlan’s next algorithm, improved over ID3 – Bottom up pruning, both categorical and continuous variables – Handling of incomplete data points C5.0 – Ross Quinlan’s commercial version 12
13
Properties of Decision Trees Non parametric approach – Does not require any prior assumptions regarding the probability distribution of the class and attributes Finding an optimal decision tree is an NP-complete problem – Heuristics used: greedy, recursive partitioning, top-down, bottom-up pruning Fast to generate, fast to classify Easy to interpret or visualize Error propagation – An error at the top of the tree propagates all the way down 13
14
References Introduction to Data Mining, by Tan, Steinbach, Kumar – Chapter 4 is available online: http://www- users.cs.umn.edu/~kumar/dmbook/ch4.pdfhttp://www- users.cs.umn.edu/~kumar/dmbook/ch4.pdf 14
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.