Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.

Slides:



Advertisements
Similar presentations
Chapter 4 Evaluating Classification and Predictive Performance
Advertisements

Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Chapter 5 – Evaluating Classification & Predictive Performance
Chapter 4 – Evaluating Classification & Predictive Performance © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel.
Logistic Regression.
Chapter 8 – Logistic Regression
Multiple Linear Regression
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Chapter 6: Model Assessment
Introduction to Data Mining with XLMiner
Analyzing Direct Marketing Campaign Performance Using Weight of Evidence Coding and Information Value through SAS® Enterprise Miner™ Incremental Response.
Chapter 7 – K-Nearest-Neighbor
Model Evaluation Metrics for Performance Evaluation
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
CAP and ROC curves.
Ensemble Learning: An Introduction
ROC Curves.
Chapter 5 Data mining : A Closer Look.
Decision Tree Models in Data Mining
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
Overview DM for Business Intelligence.
Evaluation – next steps
by B. Zadrozny and C. Elkan
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
Copyright © 2003, SAS Institute Inc. All rights reserved. Cost-Sensitive Classifier Selection Ross Bettinger Analytical Consultant SAS Services.
Chapter 9 – Classification and Regression Trees
Chapter 12 – Discriminant Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
TOPICS IN BUSINESS INTELLIGENCE K-NN & Naive Bayes – GROUP 1 Isabel van der Lijke Nathan Bok Gökhan Korkmaz.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
Evaluating Results of Learning Blaž Zupan
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Evaluating Classification Performance
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
© Galit Shmueli and Peter Bruce 2010 Chapter 6: Multiple Linear Regression Data Mining for Business Analytics Shmueli, Patel & Bruce.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Chapter 8 – Naïve Bayes DM for Business Intelligence.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
 Propensity Model  Propensity Model refers to Statistical Models predicting “willingness” to perform an action like accepting an offer etc. Acquisition.
Chapter 12 – Discriminant Analysis
Evaluation – next steps
Evaluating Results of Learning
Performance Measures II
Classification with Perceptrons Reading:
Chapter 7 – K-Nearest-Neighbor
Introduction to Data Mining and Classification
Advanced Analytics Using Enterprise Miner
Data Mining Classification: Alternative Techniques
Data Mining Practical Machine Learning Tools and Techniques
Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn Some overheads from Galit Shmueli and Peter Bruce 2010.
Model Evaluation and Selection
Roc curves By Vittoria Cozza, matr
MIS2502: Data Analytics Classification Using Decision Trees
Logistic Regression.
Machine Learning in Business John C. Hull
Information Organization: Evaluation of Classification Performance
Presentation transcript:

Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce

Why Evaluate? Multiple methods are available to classify or predict For each method, multiple choices are available for settings To choose best model, need to assess each model’s performance

Types of outcome Predicted numerical value Outcome variable is numerical and continuous Predicted class membership Outcome variable is categorical Propensity Probability of class membership when the outcome variable is categorical

Accuracy Measures (Continuous outcome)

Evaluating Predictive Performance Not the same as goodness-of-fit (R 2, S Y/X ) Predicted performance is measured using the validation dataset Benchmark is average used as prediction

Prediction accuracy measures

Excel Miner Example: Multiple Regression using Boston Housing dataset

Excel Miner Residuals for Training dataset Residuals for Validation dataset

Excel Miner Residuals for Training dataset Residuals for Validation dataset

SAS Enterprise Miner Example: Multiple Regression using Boston Housing dataset

SAS Enterprise Miner Residuals for Validation dataset

SAS Enterprise Miner Boxplot of residuals

Lift Chart Order predicted y-value from highest to lowest X-axis = Cases from 1 to n Y-axis = Cumulative predicted value of Y Two lines are plotted … one with  as the predicted value one with ŷ found using a prediction model

Excel Miner: Lift Chart

Decile-wise lift Chart Order predicted y-value from highest to lowest X-axis = % of cases from 10% to 100% i.e. 1 st decile to 10 th decile Y-axis = Cumulative predicted value of Y For each decile the ratio of sum of ŷ and sum of  is potted

XLMINER: Decile-wise lift Chart

Accuracy Measures (Classification)

Misclassification error Error = classifying a record as belonging to one class when it belongs to another class. Error rate = percent of misclassified records out of the total records in the validation data

Benchmark: Naïve Rule Naïve rule: classify all records as belonging to the most prevalent class Used only as benchmark to evaluate more complex classifiers

Separation of Records “High separation of records” means that using predictor variables attains low error “Low separation of records” means that using predictor variables does not improve much on naïve rule

High Separation of Records

Low Separation of Records

Classification Confusion Matrix 201 1’s correctly classified as “1” (True positives) 85 1’s incorrectly classified as “0” (False negatives) 25 0’s incorrectly classified as “1” (False positives) ’s correctly classified as “0” (True negatives)

Error Rate Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 – err = ( ) = 96.33% If multiple classes, error rate is: (sum of misclassified records)/(total records)

Classification Matrix: Meaning of each cell Actual Class Predicted Class C1C1 C2C2 C1C1 n 1,1 = No. of C 1 cases classified correctly as C 1 n 1,2 = No. of C 1 cases classified incorrectly as C 2 C2C2 n 2,1 = No. of C 2 cases classified incorrectly as C 1 n 2,2 = No. of C 2 cases classified correctly as C 2

Propensity Propensities are estimated probabilities that a case belongs to each of the classes It is used in two ways … To generate predicted class membership Rank ordering cases by probability of belonging to a particular class of interest

Most Data Mining algorithms classify via a 2-step process: For each case, 1. Compute probability of belonging to the class of interest 2. Compare to cutoff value, and classify accordingly Default cutoff value is 0.50 If >= 0.50, classify as “1” If < 0.50, classify as “0” Can use different cutoff values Typically, error rate is lowest for cutoff = 0.50 Propensity and Cutoff for Classification

Cutoff Table If cutoff is 0.50: thirteen records are classified as “1” If cutoff is 0.80: seven records are classified as “1”

Confusion Matrix for Different Cutoffs

When One Class is More Important In many cases it is more important to identify members of one class Tax fraud Credit default Response to promotional offer Predicting delayed flights In such cases, overall accuracy is not a good measure of evaluating the classifier. Sensitivity Specificity

Sensitivity

Specificity

ROC Curve Receiver Operating Characteristics Curve

Asymmetric Misclassification Costs

Misclassification Costs May Differ The cost of making a misclassification error may be higher for one class than the other(s) Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)

Example – Response to Promotional Offer “Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good) Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse) Predict Class 0Predict Class 1 Actual Actual 1100

The Confusion Matrix Error rate = (2+20) = 2.2% (higher than naïve rate) Predict Class 0Predict Class 1 Actual Actual 128 Suppose that using DM we can correctly classify eight 1’s as 1’s It comes at the cost of misclassifying twenty 0’s as 1’s and two 1’s as 0’s.

Introducing Costs & Benefits Suppose: Profit from a “1” is $10 Cost of sending offer is $1 Then: Under naïve rule, all are classified as “0”, so no offers are sent: no cost, no profit Under DM predictions, 28 offers are sent. 8 respond with profit of $10 each 20 fail to respond, cost $1 each 972 receive nothing (no cost, no profit) Net profit = $80 - $20 = $60

Profit Matrix Predict Class 0Predict Class 1 Actual 00-$20 Actual 10$80

Minimize Opportunity costs As we see, best to convert everything to costs, as opposed to a mix of costs and benefits E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale” Leads to same decisions, but referring only to costs allows greater applicability

Cost Matrix with opportunity costs Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1): CostsPredict Class 0Predict Class 1 Actual 0970 x $0 = $020 x $1 = $20 Actual 12 x $10 = $208 x $1 = $8 Predict Class 0Predict Class 1 Actual Actual 128 Total opportunity cost = = 48

Average misclassification cost

Generalize to Cost Ratio Sometimes actual costs and benefits are hard to estimate Need to express everything in terms of costs (i.e., cost of misclassification per record) A good practical substitute for individual costs is the ratio of misclassification costs (e.g., “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)

Multiple Classes Theoretically, there are m(m-1) misclassification costs, since any case from one of the m classes could be misclassified into any one of the m-1 other classes Practically too many to work with In decision-making context, though, such complexity rarely arises – one class is usually of primary interest For m classes, confusion matrix has m rows and m columns

Judging Ranking Performance

Lift Chart for Binary Data Input: Scored validation dataset Actual class and propensity (probability) to belong to the class of interest C 1 ) Sort records in descending order of propensity to belong to the class of interest Compute cumulative number of C 1 members for each row Lift chart is the plot with row number (no. of records) as the x-axis and cumulative number of C 1 members as the y-axis

Lift Chart for Binary Data - Example Case No. Propensity (Predicted probability of belonging to class "1") Actual class Cumulative actual classes

Lift Chart with Costs and Benefits Sort records in descending order of probability of success (success = belonging to class of interest) For each record compute cost/benefit with actual outcome Compute a column of cumulative cost/benefit Plot cumulative cost/benefit on y-axis and row number (no. of records) as the x-axis

Lift Chart with cost/benefit - Example Case No. Propensity (Predicted probability of belonging to class "1") Actual class Cost/ Benefit Cumulative Cost/benefit

Lift Curve May Go Negative If total net benefit from all cases is negative, reference line will have negative slope Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum

Negative slope to reference curve

Oversampling and Asymmetric Costs

Rare Cases Responder to mailing Someone who commits fraud Debt defaulter Often we oversample rare cases to give model more information to work with Typically use 50% “1” and 50% “0” for training Asymmetric costs/benefits typically go hand in hand with presence of rare but important class

Example Following graphs show optimal classification under three scenarios: assuming equal costs of misclassification assuming that misclassifying “o” is five times the cost of misclassifying “x” Oversampling scheme allowing DM methods to incorporate asymmetric costs

Classification: equal costs

Classification: Unequal costs Suppose that failing to catch “o” is 5 times as costly as failing to catch “x”.

Oversampling for asymmetric costs Oversample “o” to appropriately weight misclassification costs – without or with replacement

Equal number of responders Sample equal number of responders and non- responders

An Oversampling Procedure 1. Separate the responders (rare) from non- responders 2. Randomly assign half the responders to the training sample, plus equal number of non- responders 3. Remaining responders go to validation sample 4. Add non-responders to validation data, to maintain original ratio of responders to non-responders 5. Randomly take test set (if needed) from validation

Assessing model performance 1. Score the model with a validation dataset selected without oversampling 2. Score the model with a oversampled validation dataset and reweight the results to remove the effects of oversampling Method 1 is straightforward and easier to implement.

Adjusting confusion matrix for oversampling Whole dataSample Responders2%50% Nonresponders98%50% Example: One responder in whole data = 50/2 = 25 in the sample One nonresponder in whole data = 50/98 = in the sample

Adjusting confusion matrix for oversampling Suppose that the confusion matrix with oversampled validation dataset is as follows: Classification matrix, Oversampled data (Validation) Predicted 0Predcited 1Total Actual Actual Total Misclassification rate = ( )/1000 = 0.19 or 19% Percentage of records predicted as “1” = 530/1000 = 0.53 or 53%

Adjusting confusion matrix for oversampling Classification matrix, Reweighted Predicted 0Predcited 1Total Actual 0390/ = / = Actual 180/25 = /25 = Total Weight for responder (Actual 1) = 25 Weight for nonresponder (Actual 0) = Misclassification rate = ( )/1000 = or 21.9% Percentage of records predicted as “1” = 232.4/1000 = or 23.24%

Adjusting Lift curve for Oversampling Sort records in descending order of probability of success (success = belonging to class of interest) For each record compute cost/benefit with actual outcome Divide the value by the oversampling rate of the actual class Compute a cumulative column of weighted cost/benefit Plot the cumulative weighted cost/benefit on y-axis and row number (no. of records) as the x-axis

Adjusting Lift curve for Oversampling -- Example Cost/Benefit Oversampling weight Actual Actual Suppose that cost/benefit and oversampling weights are as follows: Case No. Propensity (Predicted probability of belonging to class "1")Actual class weighted Cost/Benefit Cumulative weighted Cost/benefit

Adjusting Lift curve for Oversampling -- Example