Download presentation
Presentation is loading. Please wait.
Published byMatthew Walter McDaniel Modified over 5 years ago
1
Dept. of Computer Science University of Liverpool
COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 Classification: Trees February 11, Slide 1
2
COMP527: Data Mining COMP527: Data Mining Introduction to the Course
Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Classification: Trees February 11, Slide 2
3
Tree Learning Algorithm Attribute Splitting Decisions Random
Today's Topics COMP527: Data Mining Trees Tree Learning Algorithm Attribute Splitting Decisions Random 'Purity Count' Entropy (aka ID3) Information Gain Ratio Classification: Trees February 11, Slide 3
4
Trees COMP527: Data Mining Anything can be made better by storing it in a tree structure! (Not really!) Instead of having lists or sets of rules, why not have a tree of rules? Then there's no problem with order, or repeating the same test over and over again in different conjunctive rules. So each node in the tree is an attribute test, the branches from that node are the different outcomes. Instead of 'separate and conquer', Decision Trees are the more typical 'divide and conquer' approach. Once the tree is built, new instances can be tested by simply stepping through each test. Classification: Trees February 11, Slide 4
5
Here's our example data again:
COMP527: Data Mining Here's our example data again: How to construct a tree from it, instead of rules? Classification: Trees February 11, Slide 5
6
Tree Learning Algorithm
COMP527: Data Mining Trivial Tree Learner: create empty tree T select attribute A create branches in T for each value v of A for each branch, recurse with instances where A=v add tree as branch node Most interesting part of this algorithm is line 2, the attribute selection. Let's start with a Random selection, then look at how it might be improved. Classification: Trees February 11, Slide 6
7
Random method: Let's pick 'windy'
COMP527: Data Mining Random method: Let's pick 'windy' Need to split again, looking at only the 8 and 6 instances respectively. For windy=false, we'll randomly select outlook: sunny: no, no, yes | overcast: yes, yes | rainy: yes, yes, yes As all instances of overcast and rainy are yes, they stop, sunny continues. Windy false true 6 yes 2 no 3 yes 3 no Classification: Trees February 11, Slide 7
8
Attribute Selection COMP527: Data Mining As we may have thousands of attributes and/or values to test, we want to construct small decision trees. Think back to RIPPER's description length ... the smallest decision tree will have the smallest description length. So how can we reduce the number of nodes in the tree? We want all paths through the tree to be as short as possible. Nodes with one class stop a path, so we want those to appear early in the tree, otherwise they'll occur in multiple branches. Think back: the first rule we generated was outlook=overcast because it was pure. Classification: Trees February 11, Slide 8
9
Attribute Selection: Purity
COMP527: Data Mining 'Purity' count: Select attribute that has the most 'pure' nodes, randomise equal counts. Still mediocre. Most data sets won't have pure nodes for several levels. Need a measure of the purity instead of the simple count. Outlook sunny rainy 2 yes 3 no 4 yes 3 yes 2 no Classification: Trees February 11, Slide 9
10
Attribute Selection: Entropy
COMP527: Data Mining For each test: Maximal purity: All values are the same Minimal purity: Equal number of each value Find a scale between maximal and minimal, and then merge across all of the attribute tests. One function that calculates this is the Entropy function: entropy(p1,p2...,pn) = -p1*log(p1) + -p2*log(p2) pn*log(pn) p1 ... pn are the number of instances of each class, expressed as a fraction of the total number of instances at that point in the tree. log is base 2. Classification: Trees February 11, Slide 10
11
Attribute Selection: Entropy
COMP527: Data Mining entropy(p1,p2...,pn) = -p1*log(p1) + -p2*log(p2) pn *log(pn) This is to calculate one test. For outlook there are three tests: sunny: info(2,3) = -2/5 log(2/5) -3/5 log(3/5) = = 0.971 overcast: info(4,0) = -(4/4*log(4/4)) + -(0*log(0)) Ohoh! log(0) is undefined. But note that we're multiplying it by 0, so what ever it is the final result will be 0. Classification: Trees February 11, Slide 11
12
Attribute Selection: Entropy
COMP527: Data Mining sunny: info(2,3) = 0.971 overcast: info(4,0) = 0.0 rainy: info(3,2) = 0.971 But we have 14 instances to divide down those paths... So the total for outlook is: (5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971) = 0.693 Now to calculate the gain, we work out the entropy for the top node and subtract the entropy for outlook: info(9,5) = 0.940 gain(outlook) = = 0.247 Classification: Trees February 11, Slide 12
13
Attribute Selection: Entropy
COMP527: Data Mining Now to calculate the gain for all of the attributes: gain(outlook) = 0.247 gain(humidity) = 0.152 gain(windy) = 0.048 gain(temperature) = 0.029 And select the maximum ... which is outlook. This is (also!) called information gain. The total is the information, measured in 'bits'. Equally we could select the minimum amount of information needed -- the minimum description length issue in RIPPER. Let's do the next level, where outlook=sunny. Classification: Trees February 11, Slide 13
14
Attribute Selection: Entropy
COMP527: Data Mining Now to calculate the gain for all of the attributes: Temp: hot info(0,2) mild info(1,1) cool info(1,0) Humidity: high info(0,3) normal info(2,0) Windy: false info(1,2) true info(1,1) Don't even need to do the math. Humidity is the obvious choice as it predicts all 5 instances correctly. Thus the information will be 0, and the gain will be maximal. Classification: Trees February 11, Slide 14
15
Attribute Selection: Entropy
COMP527: Data Mining Now our tree looks like: This algorithm is called ID3, developed by Quinlan. Outlook sunny rainy Humidity yes ? normal high yes no Classification: Trees February 11, Slide 15
16
Eg: info(0,1) info(0,1) info(1,0) ...
Entropy: Issues COMP527: Data Mining Nasty side effect of Entropy: It prefers attributes with a large number of branches. Eg, if there was an 'identifier' attribute with a unique value, this would uniquely determine the class, but be useless for classification. (over-fitting!) Eg: info(0,1) info(0,1) info(1,0) ... Doesn't need to be unique. If we assign 1 to the first two instances, 2 to the second and so forth, we still get a 'better' split. Classification: Trees February 11, Slide 16
17
Half-Identifier 'attribute':
Entropy: Issues COMP527: Data Mining Half-Identifier 'attribute': info(0,2) info(2,0) info(1,1) info(1,1) info(2,0) info(2,0) info(1,1) = 2/14 down each route, so: = 0*2/14 + 0*2/ *2/ *2/ = 3 * (2/14 * 0.5) = 3/14 = 0.214 Gain is: = 0.726 Remember that the gain for Outlook was only 0.247! Urgh. Once more we run into over-fitting. Classification: Trees February 11, Slide 17
18
eg info(2,2,2,2,2,2,2) for half-identifier and info(5,4,5) for outlook
Gain Ratio COMP527: Data Mining Solution: Use a gain ratio. Calculate the entropy disregarding classes for all of the daughter nodes: eg info(2,2,2,2,2,2,2) for half-identifier and info(5,4,5) for outlook identifier = -1/14 * log(1/14) * 14 = 3.807 half-identifier = -1/7 * log(1/7) * 7 = 2.807 outlook = 1.577 Ratios: identifier = / = 0.247 half-identifier = / = 0.259 outlook = / = 0.157 Classification: Trees February 11, Slide 18
19
Gain Ratio COMP527: Data Mining Close to success: Picks half-identifier (only accurate in 4/7 branches) over identifier (accurate in all 14 branches)! half-identifier = 0.259 identifier = 0.247 outlook = 0.157 humidity = 0.152 windy = 0.049 temperature = 0.019 Humidity is now also very close to outlook, whereas before they were separated. Classification: Trees February 11, Slide 19
20
Gain Ratio COMP527: Data Mining We can simply check for identifier like attributes and ignore them. Actually, they should be removed from the data before the data mining begins. However the ratio can also over-compensate. It might pick an attribute because it's entropy is low. Note how close humidity and outlook became... maybe that's not such a good thing? Possible Fix: First generate the information gain. Throw away any attributes with less than the average. Then compare using the ratio. Classification: Trees February 11, Slide 20
21
An alternative method to Information Gain is called the Gini Index
Alternative: Gini COMP527: Data Mining An alternative method to Information Gain is called the Gini Index The total for node D is: gini(D) = 1 - sum(p12, p22, ... pn2) Where p1..n are the frequency ratios of class 1..n in D. So the Gini Index for the entire set: = 1- (9/ /142) = 1 - ( ) = 0.459 Classification: Trees February 11, Slide 21
22
The gini value of a split of D into subsets is:
COMP527: Data Mining The gini value of a split of D into subsets is: Split(D) = N1/N gini(D1) + N2/N gini(D2) + Nn/N gini(Dn) Where N' is the size of split D', and N is the size of D. eg: Outlook splits into 5,4,5: split = 5/14 gini(sunny) + 4/14 gini(overcast) + 5/14 gini(rainy) sunny = 1-sum(2/52, 3/52) = = 0.624 overcast= 1- sum(4/42, 0/42) = 0.0 rainy = sunny split = (5/14 * 0.624) * 2 = 0.446 Classification: Trees February 11, Slide 22
23
(Left as an exercise for you to do!)
Gini COMP527: Data Mining The attribute that generates the smallest gini split value is chosen to split the node on. (Left as an exercise for you to do!) Gini is used in CART (Classification and Regression Trees), IBM's IntelligentMiner system, SPRINT (Scalable PaRallelizable INduction of decision Trees). It comes from an Italian statistician who used it to measure income inequality. Classification: Trees February 11, Slide 23
24
The various problems that a good DT builder needs to address:
Decision Tree Issues COMP527: Data Mining The various problems that a good DT builder needs to address: Ordering of Attribute Splits As seen, we need to build the tree picking the best attribute to split on first. Numeric/Missing Data Dividing numeric data is more complicated. How? Tree Structure A balanced tree with the fewest levels is preferable. Stopping Criteria Like with rules, we need to stop adding nodes at some point. When? Pruning It may be beneficial to prune the tree once created? Or incrementally? Classification: Trees February 11, Slide 24
25
Introductory statistical text books Witten, 3.2, 4.3 Dunham, 4.4
Further Reading COMP527: Data Mining Introductory statistical text books Witten, 3.2, 4.3 Dunham, 4.4 Han, 6.3 Berry and Browne, Chapter 4 Berry and Linoff, Chapter 6 Classification: Trees February 11, Slide 25
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.