Dept. of Computer Science University of Liverpool

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Web Usage Mining Classification Fang Yao MEMS Humboldt Uni zu Berlin.
CHAPTER 9: Decision Trees
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
Classification Techniques: Decision Tree Learning
Decision Trees.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Decision Trees an Introduction.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Decision Trees.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.3: Association Rules Rodney Nielsen Many / most of these slides were adapted from:
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Data Science Algorithms: The Basic Methods
Classification Algorithms
Prepared by: Mahmoud Rafeek Al-Farra
Data Science Algorithms: The Basic Methods
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Classification and Prediction
Advanced Artificial Intelligence
ID3 Algorithm.
Machine Learning Techniques for Data Mining
Clustering.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Classification with Decision Trees
Machine Learning Chapter 3. Decision Tree Learning
Dept. of Computer Science University of Liverpool
Machine Learning in Practice Lecture 17
Dept. of Computer Science University of Liverpool
Chapter 7: Transformations
INTRODUCTION TO Machine Learning 2nd Edition
©Jiawei Han and Micheline Kamber
Dept. of Computer Science University of Liverpool
Data Mining CSCI 307, Spring 2019 Lecture 15
Data Mining CSCI 307, Spring 2019 Lecture 18
Data Mining CSCI 307, Spring 2019 Lecture 6
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

Dept. of Computer Science University of Liverpool COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan (mskhan@liv.ac.uk)‏ Dept. of Computer Science University of Liverpool 2009 Classification: Trees February 11, 2009 Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Classification: Trees February 11, 2009 Slide 2

Tree Learning Algorithm Attribute Splitting Decisions Random Today's Topics COMP527: Data Mining Trees Tree Learning Algorithm Attribute Splitting Decisions Random 'Purity Count' Entropy (aka ID3)‏ Information Gain Ratio Classification: Trees February 11, 2009 Slide 3

Trees COMP527: Data Mining Anything can be made better by storing it in a tree structure! (Not really!)‏ Instead of having lists or sets of rules, why not have a tree of rules? Then there's no problem with order, or repeating the same test over and over again in different conjunctive rules. So each node in the tree is an attribute test, the branches from that node are the different outcomes. Instead of 'separate and conquer', Decision Trees are the more typical 'divide and conquer' approach. Once the tree is built, new instances can be tested by simply stepping through each test. Classification: Trees February 11, 2009 Slide 4

Here's our example data again: COMP527: Data Mining Here's our example data again: How to construct a tree from it, instead of rules? Classification: Trees February 11, 2009 Slide 5

Tree Learning Algorithm COMP527: Data Mining Trivial Tree Learner: create empty tree T select attribute A create branches in T for each value v of A for each branch, recurse with instances where A=v add tree as branch node Most interesting part of this algorithm is line 2, the attribute selection. Let's start with a Random selection, then look at how it might be improved. Classification: Trees February 11, 2009 Slide 6

Random method: Let's pick 'windy' COMP527: Data Mining Random method: Let's pick 'windy' Need to split again, looking at only the 8 and 6 instances respectively. For windy=false, we'll randomly select outlook: sunny: no, no, yes | overcast: yes, yes | rainy: yes, yes, yes As all instances of overcast and rainy are yes, they stop, sunny continues. Windy false true 6 yes 2 no 3 yes 3 no Classification: Trees February 11, 2009 Slide 7

Attribute Selection COMP527: Data Mining As we may have thousands of attributes and/or values to test, we want to construct small decision trees. Think back to RIPPER's description length ... the smallest decision tree will have the smallest description length. So how can we reduce the number of nodes in the tree? We want all paths through the tree to be as short as possible. Nodes with one class stop a path, so we want those to appear early in the tree, otherwise they'll occur in multiple branches. Think back: the first rule we generated was outlook=overcast because it was pure. Classification: Trees February 11, 2009 Slide 8

Attribute Selection: Purity COMP527: Data Mining 'Purity' count: Select attribute that has the most 'pure' nodes, randomise equal counts. Still mediocre. Most data sets won't have pure nodes for several levels. Need a measure of the purity instead of the simple count. Outlook sunny rainy 2 yes 3 no 4 yes 3 yes 2 no Classification: Trees February 11, 2009 Slide 9

Attribute Selection: Entropy COMP527: Data Mining For each test: Maximal purity: All values are the same Minimal purity: Equal number of each value Find a scale between maximal and minimal, and then merge across all of the attribute tests. One function that calculates this is the Entropy function: entropy(p1,p2...,pn) = -p1*log(p1) + -p2*log(p2) + ... -pn*log(pn)‏ p1 ... pn are the number of instances of each class, expressed as a fraction of the total number of instances at that point in the tree. log is base 2. Classification: Trees February 11, 2009 Slide 10

Attribute Selection: Entropy COMP527: Data Mining entropy(p1,p2...,pn) = -p1*log(p1) + -p2*log(p2) + ... -pn *log(pn)‏ This is to calculate one test. For outlook there are three tests: sunny: info(2,3) = -2/5 log(2/5) -3/5 log(3/5) = 0.5287 + 0.4421 = 0.971 overcast: info(4,0) = -(4/4*log(4/4)) + -(0*log(0))‏ Ohoh! log(0) is undefined. But note that we're multiplying it by 0, so what ever it is the final result will be 0. Classification: Trees February 11, 2009 Slide 11

Attribute Selection: Entropy COMP527: Data Mining sunny: info(2,3) = 0.971 overcast: info(4,0) = 0.0 rainy: info(3,2) = 0.971 But we have 14 instances to divide down those paths... So the total for outlook is: (5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971) = 0.693 Now to calculate the gain, we work out the entropy for the top node and subtract the entropy for outlook: info(9,5) = 0.940 gain(outlook) = 0.940 - 0.693 = 0.247 Classification: Trees February 11, 2009 Slide 12

Attribute Selection: Entropy COMP527: Data Mining Now to calculate the gain for all of the attributes: gain(outlook) = 0.247 gain(humidity) = 0.152 gain(windy) = 0.048 gain(temperature) = 0.029 And select the maximum ... which is outlook. This is (also!) called information gain. The total is the information, measured in 'bits'. Equally we could select the minimum amount of information needed -- the minimum description length issue in RIPPER. Let's do the next level, where outlook=sunny. Classification: Trees February 11, 2009 Slide 13

Attribute Selection: Entropy COMP527: Data Mining Now to calculate the gain for all of the attributes: Temp: hot info(0,2) mild info(1,1) cool info(1,0)‏ Humidity: high info(0,3) normal info(2,0)‏ Windy: false info(1,2) true info(1,1)‏ Don't even need to do the math. Humidity is the obvious choice as it predicts all 5 instances correctly. Thus the information will be 0, and the gain will be maximal. Classification: Trees February 11, 2009 Slide 14

Attribute Selection: Entropy COMP527: Data Mining Now our tree looks like: This algorithm is called ID3, developed by Quinlan. Outlook sunny rainy Humidity yes ? normal high yes no Classification: Trees February 11, 2009 Slide 15

Eg: info(0,1) info(0,1) info(1,0) ... Entropy: Issues COMP527: Data Mining Nasty side effect of Entropy: It prefers attributes with a large number of branches. Eg, if there was an 'identifier' attribute with a unique value, this would uniquely determine the class, but be useless for classification. (over-fitting!)‏ Eg: info(0,1) info(0,1) info(1,0) ... Doesn't need to be unique. If we assign 1 to the first two instances, 2 to the second and so forth, we still get a 'better' split. Classification: Trees February 11, 2009 Slide 16

Half-Identifier 'attribute': Entropy: Issues COMP527: Data Mining Half-Identifier 'attribute': info(0,2) info(2,0) info(1,1) info(1,1) info(2,0) info(2,0) info(1,1) = 0 0 0.5 0.5 0 0 0.5 2/14 down each route, so: = 0*2/14 + 0*2/14 + 0.5*2/14 + 0.5*2/14 + ... = 3 * (2/14 * 0.5)‏ = 3/14 = 0.214 Gain is: 0.940 - 0.214 = 0.726 Remember that the gain for Outlook was only 0.247! Urgh. Once more we run into over-fitting. Classification: Trees February 11, 2009 Slide 17

eg info(2,2,2,2,2,2,2) for half-identifier and info(5,4,5) for outlook Gain Ratio COMP527: Data Mining Solution: Use a gain ratio. Calculate the entropy disregarding classes for all of the daughter nodes: eg info(2,2,2,2,2,2,2) for half-identifier and info(5,4,5) for outlook identifier = -1/14 * log(1/14) * 14 = 3.807 half-identifier = -1/7 * log(1/7) * 7 = 2.807 outlook = 1.577 Ratios: identifier = 0.940 / 3.807 = 0.247 half-identifier = 0.726 / 2.807 = 0.259 outlook = 0.247 / 1.577 = 0.157 Classification: Trees February 11, 2009 Slide 18

Gain Ratio COMP527: Data Mining Close to success: Picks half-identifier (only accurate in 4/7 branches) over identifier (accurate in all 14 branches)! half-identifier = 0.259 identifier = 0.247 outlook = 0.157 humidity = 0.152 windy = 0.049 temperature = 0.019 Humidity is now also very close to outlook, whereas before they were separated. Classification: Trees February 11, 2009 Slide 19

Gain Ratio COMP527: Data Mining We can simply check for identifier like attributes and ignore them. Actually, they should be removed from the data before the data mining begins. However the ratio can also over-compensate. It might pick an attribute because it's entropy is low. Note how close humidity and outlook became... maybe that's not such a good thing? Possible Fix: First generate the information gain. Throw away any attributes with less than the average. Then compare using the ratio. Classification: Trees February 11, 2009 Slide 20

An alternative method to Information Gain is called the Gini Index Alternative: Gini COMP527: Data Mining An alternative method to Information Gain is called the Gini Index The total for node D is: gini(D) = 1 - sum(p12, p22, ... pn2)‏ Where p1..n are the frequency ratios of class 1..n in D. So the Gini Index for the entire set: = 1- (9/142 + 5/142) = 1 - (0.413 + 0.127)‏ = 0.459 Classification: Trees February 11, 2009 Slide 21

The gini value of a split of D into subsets is: COMP527: Data Mining The gini value of a split of D into subsets is: Split(D) = N1/N gini(D1) + N2/N gini(D2) + Nn/N gini(Dn)‏ Where N' is the size of split D', and N is the size of D. eg: Outlook splits into 5,4,5: split = 5/14 gini(sunny) + 4/14 gini(overcast) + 5/14 gini(rainy)‏ sunny = 1-sum(2/52, 3/52) = 1 - 0.376 = 0.624 overcast= 1- sum(4/42, 0/42) = 0.0 rainy = sunny split = (5/14 * 0.624) * 2 = 0.446 Classification: Trees February 11, 2009 Slide 22

(Left as an exercise for you to do!)‏ Gini COMP527: Data Mining The attribute that generates the smallest gini split value is chosen to split the node on. (Left as an exercise for you to do!)‏ Gini is used in CART (Classification and Regression Trees), IBM's IntelligentMiner system, SPRINT (Scalable PaRallelizable INduction of decision Trees). It comes from an Italian statistician who used it to measure income inequality. Classification: Trees February 11, 2009 Slide 23

The various problems that a good DT builder needs to address: Decision Tree Issues COMP527: Data Mining The various problems that a good DT builder needs to address: Ordering of Attribute Splits As seen, we need to build the tree picking the best attribute to split on first. Numeric/Missing Data Dividing numeric data is more complicated. How? Tree Structure A balanced tree with the fewest levels is preferable. Stopping Criteria Like with rules, we need to stop adding nodes at some point. When? Pruning It may be beneficial to prune the tree once created? Or incrementally? Classification: Trees February 11, 2009 Slide 24

Introductory statistical text books Witten, 3.2, 4.3 Dunham, 4.4 Further Reading COMP527: Data Mining Introductory statistical text books Witten, 3.2, 4.3 Dunham, 4.4 Han, 6.3 Berry and Browne, Chapter 4 Berry and Linoff, Chapter 6 Classification: Trees February 11, 2009 Slide 25