Business Intelligence Technologies – Data Mining

Slides:



Advertisements
Similar presentations
Chapter 7 Classification and Regression Trees
Advertisements

Random Forest Predrag Radenković 3237/10
CHAPTER 9: Decision Trees
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Lecture outline Classification Decision-tree classification.
Decision Tree Algorithm
Basic Data Mining Techniques Chapter Decision Trees.
Ensemble Learning: An Introduction
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Tree-based methods, neutral networks
Basic Data Mining Techniques
Classification Continued
Lecture 5 (Classification with Decision Trees)
Decision Trees an Introduction.
Three kinds of learning
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Classification.
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
Learning Chapter 18 and Parts of Chapter 20
Introduction to Directed Data Mining: Decision Trees
Chapter 7 Decision Tree.
Basic Data Mining Techniques
ID3 Algorithm Allan Neymark CS157B – Spring 2007.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Machine Learning Queens College Lecture 2: Decision Trees.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
May 27, 2016Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length.
Classification and Prediction
Lecture Notes for Chapter 4 Introduction to Data Mining
Presentation on Decision trees Presented to: Sir Marooof Pasha.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
Classification and Regression Trees
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Mining Classification: Basic Concepts and Techniques
Classification and Prediction
ID3 Algorithm.
Data Mining – Chapter 3 Classification
CSCI N317 Computation for Scientific Applications Unit Weka
Presentation transcript:

Business Intelligence Technologies – Data Mining Lecture 4 Classification, Decision Tree

Classification Mapping instances onto a predefined set of classes. Age Income City Gender Response 30 50K New York M No 50 125K Tampa F Yes … 28 35K Orlando ??? .. Mapping instances onto a predefined set of classes. Examples Classify each customers as “Responder” versus “Non-Responder” Classify cellular calls as “legitimate” versus “fraudulent”

Prediction Broad definition: build model to estimate any type of values (predictive data mining) Narrow definition: estimate continuous values Examples Predict how much money a customer will spend Predict the value of a stock Age Income City Gender Dollar Spent 30 50K New York M $150 50 125K Tampa F $400 … $0 $230 28 35K Orlando ??? ..

Classification: Terminology Inputs = Predictors = Independent Variables Outputs = Responses = Dependent Variables Models = Classifiers Data points: examples, instances With classification, we want to build a (classification) model to predict the outputs given the inputs.

Steps in Classification Input & Output Step 1: Building the Model Step 2: Validating the Model Step 3: Applying/Using the Model Model Validated Model Output Input only Training Data Test Data New Data

Common Classification Techniques Decision tree Logistics regression K-nearest neighbors Neural Network Naïve Bayes

Decision Tree --- An Example Employed Root Decision Tree --- An Example Yes No Class=Not Default Balance Node <50K >=50K Class=Not Default Age >45 <=45 Class=Not Default Class= Default Leaf

Decision Tree Representation A series of nested tests: Each node represents a test on one attribute Nominal attributes: each branch could represent one or more values Numeric attributes are split into ranges, normally binary split Leaves A class assignment (E.g, Default /Not default) Balance >=50K <50K Age <=45 >45 Employed Class=No No Yes Class=

The Use of a Decision Tree: Classifying New Instances To determine the class of a new instance: e.g., Mark, age 40, unemployed, balance 88K. The instance is routed down the tree according to values of attributes. At each node a test is applied to one attribute. When a leaf is reached the instance is assigned to a class. Mark: Yes Balance >=50K <50K Age <=45 >45 Employed Class=No No Yes Class=

Goal of Decision Tree Construction Partition the training instances into purer sub groups pure: the instances in a sub-group mostly belong to the same class Age>=45 Balance >=50K Age<45 Entire population Age>=45 Balance <50K Age<45 Default Not default How to build a tree: How to split instances into purer sub-groups

Why do we want to identify pure sub groups? To classify a new instance, we can determine the leaf that the instance belongs to based on its attributes. If the leaf is very pure (e.g. all have defaulted) we can determine with greater confidence that the new instance belongs to this class (i.e., the “Default” class.) If the leaf is not very pure (e.g. a 50%/50% mixture of the two classes, Default and Not Default), our prediction for the new instance is more like a random guessing.

Decision Tree Construction A tree is constructed by recursively partitioning the examples. With each partition the examples are split into increasingly purer sub groups. The key in building a tree: How to split

Building a Tree - Choosing a Split 3 4 ApplicantID 1 2 Philly City Children Income Status  Many Medium DEFAULTS  Many Low DEFAULTS Few Medium PAYS  Few High PAYS  Try split on Children attribute: Try split on Income attribute: Many   Low  Children Income   Medium       Few     High  Notice how the split on the Children attribute gives purer partitions. It is therefore chosen as the first split (and in this case the only split – because the two sub-groups are 100% pure).

Recursive Steps in Building a Tree Example STEP 1: Split Option A STEP 1: Split Option B                                 Not good as sub-nodes are still very heterogenous! Better, as purity of sub-nodes is improving. STEP 2: Choose Split Option B as it is the better split. STEP 3: Try out splits on each of the sub-nodes of Split Option B. Eventually, we arrive at:   Notice how examples in a parent node are split between sub-nodes - i.e. notice how the training examples are partitioned into smaller and smaller subsets. Also, notice that sub-nodes are purer than parent nodes.                      

Recursive Steps in Building a Tree Try using different attributes to split the training examples into different subsets. STEP 2: Rank the splits. Choose the best split. STEP 3: For each node obtained by splitting, repeat from STEP 1, until no more good splits are possible. Note: Usually it is not possible to create leaves that are completely pure - i.e. contain one class only - as that would result in a very bushy tree which is not sufficiently general. However, it is possible to create leaves that are purer – i.e. contain predominantly one class - and we can settle for that.

Purity Measures Purity measures: Many available Gini (population diversity) Entropy (information gain) Information Gain Ratio Chi-square Test Most common one (from information theory) is: Information Gain Informally: How informative is the attribute in distinguishing among instances (e.g., credit applicants) from different classes (Yes/No default)

Information Gain Consider the two following splits. Which one is more informative? Split over whether Balance exceeds 50K Split over whether applicant is employed Over 50K Less or equal 50K Employed Unemployed

Information Gain Impurity/Entropy: Measures the level of impurity/chaos in a group of examples Information gain is defined as the decrease in impurity with the split generating more pure sub-groups

Impurity Very impure group Less impure Minimum impurity When examples can belong to one of two classes: What is the worst case of impurity?

Calculating Impurity Impurity = pi is proportion of class i For example: our initial population is composed of 16 cases of class “Default” and 14 cases of class “Not default” Impurity (entire population of examples)=

2-class Cases: Minimum impurity What is the impurity of a group in which all examples belong to the same class? Impurity= - 1 log21 - 0 log20 = 0 (lowest possible value) What is the impurity of a group with 50% in either class? Impurity= -0.5 log20.5 – 0.5 log20.5 =1 (highest possible value) Maximum impurity 0 log(0) is defined as 0

Calculating Information Gain Information Gain = Impurity (parent) – Impurity (children) Entire population (30 instances) 17 instances Balance>=50K Balance<50K 13 instances (Weighted) Average Impurity of Children = Information Gain= 0.997 - 0.615 = 0.382

Information Gain of a Split The weighted average of the impurity of the children nodes after a split is compared with the impurity of the parent node to see how much information is gained through the split. A child node with fewer examples in it has a lower weight

Information Gain Information Gain = Impurity (parent) – Impurity (children) Gain=0.382 Impurity(A) =0.997 Impurity(B,C) = 0.615 Impurity(D,E) =0.406 Gain=0.381 D Age>=45 B Impurity(D)=0 Log20 +1 log21=0 Entire population Balance>=50K Age<45 Impurity(B) = 0.787 A E C Impurity(E) = -3/7 Log23/7 -4/7Log24/7=0.985 Balance<50K Impurity (C)= 0.391

Which attribute to split over? At each node examine splits over each of the attributes Select the attribute for which the maximum information gain is obtained For a continuous attribute, also need to consider different ways of splitting (>50 or <=50; >60 or <=60) For a categorical attribute with lots of possible values, sometimes also need to consider how to group these values ( branch 1 corresponds to {A,B,E} and branch 2 corresponds to {C,D,F,G})

Calculating the Information Gain of a Split For each sub-group produced by the split, calculate the impurity/entropy of that subset. Calculate the weighted impurity of the split by weighting each sub-group’s impurity by the proportion of training examples (out of the training examples in the parent node) that are in that subset. Calculate the impurity of the parent node, and subtract the weighted impurity of the child nodes to obtain the information gain for the split. Note: If impurity is increasing with the split then the tree is getting worse, so we wouldn’t want to make the split ! So, information gain of the split needs to be positive in order to choose to split. 26

Person Homer 0” 250 36 M Marge 10” 150 34 F Bart 2” 90 10 Lisa 6” 78 8 Hair Length Weight Age Class Homer 0” 250 36 M Marge 10” 150 34 F Bart 2” 90 10 Lisa 6” 78 8 Maggie 4” 20 1 Abe 1” 170 70 Selma 8” 160 41 Otto 180 38 Krusty 200 45 Comic 8” 290 38 ?

Let us try splitting on Hair length Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 yes no Hair Length <= 5? Let us try splitting on Hair length Entropy(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5) = 0.9710 Entropy(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4) = 0.8113 Gain= Entropy of parent – Weighted average of entropies of the children Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911

Let us try splitting on Weight Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 yes no Weight <= 160? Let us try splitting on Weight Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219 Entropy(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4) = 0 Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900

Let us try splitting on Age Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 yes no age <= 40? Let us try splitting on Age Entropy(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6) = 1 Entropy(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3) = 0.9183 Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183

Of the 3 features we had, Weight was the best Of the 3 features we had, Weight was the best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply continue splitting! yes no Weight <= 160? yes no Hair Length <= 2? This time we find that we can split on Hair length, and then we are done! Aha, I got a tree! Note: the splitting decision is not only to choose attribute, but to choose the value to split for a continuous attribute (e.g. Hair <=5 or Hair <=2)

Building a Tree - Stopping Criteria You can stop building the tree when: The impurity of all nodes is zero: Problem is that this tends to lead to bushy, highly-branching trees, often with one example at each node. No split achieves a significant gain in purity (information gain not high enough) Node size is too small: That is, there are less than a certain number of examples, or proportion of the training set, at each node.

Reading Rules off the Decision Tree For each leaf in the tree, read the rule from the root to that leaf. You will arrive at a set of rules. Income Debts Gender Children Responder Non-Responder Low Male High Female Many Few IF Income=Low AND Debts=Low THEN Non-Responder IF Income=Low AND Debts=High THEN Responder IF Income=High AND Gender=Male AND Children=Many THEN Responder IF Income=High AND Gender=Female THEN Non-Responder IF Income=High AND Gender=Male AND Children=Few THEN Non-Responder

Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model If the accuracy on testing data is acceptable, we can use the model to classify instances whose class labels are not known

Glasses Yes No Hair Female Short Long Male Tree 1 Training Testing Hair Female Short Long Male Tree 2 There are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets.

Overfitting & Underfitting Overfitting: the model performs poorly on new examples (e.g. testing examples) as it is too highly trained to the specific training examples (pick up patterns and noises). Underfitting: the model performs poorly on new examples as it is too simplistic to distinguish between them (i.e. has not picked up the important patterns from the training examples) underfitting overfitting Notice how the error rate on the testing data increases for overly large trees.

Pruning A decision trees is typically more accurate on its training data than on its test data. Removing branches from a tree can often improve its accuracy on a test set - so-called ‘reduced error pruning’. The intention of this pruning is to cut off branches from the tree when this improves performance on test data - this reduces overfitting and makes the tree more general. Small is beautiful.

Decision Tree Classification in a Nutshell A tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers To avoid overfitting Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree

Strengths & Weaknesses In practice: One of the most popular method. Why? Very comprehensible – the tree structure specifies the entire decision structure Easy for decision makers to understand model’s rational Map nicely to a set of business rules Relatively easy to implement Very fast to run (to classify examples) with large data sets Good at handling missing values: just treat “missing” as a value – can become a good predictor Weakness Bad at handling continuous data, good at categorical input and output. Continuous output: high error rate Continuous input: ranges may introduce bias

Decision Tree Variations Regression Trees The leaves of a regression tree predict the average value for examples that reach that node. Average Income = $10,000 p.a. Standard Deviation = $1,000 Yes Trust Fund Average Income = $100 p.a. Standard Deviation = $10 No <18 Yes Average Income = $100,000 p.a. Standard Deviation = $5,000 Professional Age 18-65 Average Income = $30,000 p.a. Standard Deviation = $2,000 No >65 Yes Retirement Fund Average Income = $20,000 p.a. Standard Deviation = $800 No Average Income = $1,000 p.a. Standard Deviation = $150

Decision Tree Variations Model Trees The leaves of a model tree specify a function that can be used to predict the value of examples that reach that node. Income Per Annum = Number Of Trust Funds * $5,000 Yes Trust Fund Income Per Annum = Age * $50 No Income Per Annum = $80,000 + (Age * $2,000) <18 Yes Professional Age 18-65 Income Per Annum = $20,000 + (Age * $1,000) No >65 Yes Retirement Fund Income Per Annum = FundValue * 10% No Income Per Annum = $100 * Number of Children

Classification Tree vs. Regression Tree Categorical output The leaves of the tree assign a class label or a probability of being in a class Achieve nodes such that one class is predominant at each node. Regression Tree: Numeric output The leaves of the tree assign an average value (regression trees) or specify a function that can be used to compute a value (model trees) for examples that reach that node. Achieve nodes such that means between nodes vary as much as possible and standard-deviation or variance within each node is as low as possible.

Different Decision Tree Algorithms ID3, ID4, ID5, C4.0, C4.5, C5.0, ACLS, and ASSISTANT: Use information gain as splitting criterion CART (Classification And Regression Trees): Uses Gini diversity index as measure of impurity when deciding splitting. CHAID: A statistical approach that uses the Chi-squared test when deciding on the best split. Hunt’s Concept Learning System (CLS), and MINIMAX: Minimizes the cost of classifying examples correctly or incorrectly.

10-fold Cross Validation Break data into 10 sets of size n/10. Train on 9 datasets and test on 1. Repeat 10 times and take a mean accuracy.

SAS: % Captured Response

SAS: Lift Value In SAS, %Response = Percentage of Responses in the top n% ranked individuals. It should be relatively high in the top deciles. And a decreasing plotted curve indicates a good model. The lift chart captures the same information on a different scale.

Case Discussion Fleet HIV How many input variables are used to build the tree? How many show up in the tree built? Why? How can the tree built be used for segmentation? How can the new campaign results help enhance the tree? HIV How does the pruning process work? How does decision tree pick the interactions among variables?

Exercise – Decision Tree Customer ID Student Credit Rating Class: Buy PDA 1 No Fair 2 Excellent 3 Yes 4 5 6 7 8 Which attribute to split on first? log2(2/3) = -0.585, log2(1/3) = -1.585, log2(1/2) = -1, log2(3/5) = -0.737, Log2(2/5) = -1.322, log2(1/4) = -2, log2(3/4) = -0.415