Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.

Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce

Why Evaluate? Multiple methods are available to classify or predict For each method, multiple choices are available for settings To choose best model, need to assess each model’s performance

Types of outcome Predicted numerical value Outcome variable is numerical and continuous Predicted class membership Outcome variable is categorical Propensity Probability of class membership when the outcome variable is categorical

Accuracy Measures (Continuous outcome)

Evaluating Predictive Performance Not the same as goodness-of-fit (R 2, S Y/X ) Predicted performance is measured using the validation dataset Benchmark is average used as prediction

Prediction accuracy measures

Excel Miner Example: Multiple Regression using Boston Housing dataset

Excel Miner Residuals for Training dataset Residuals for Validation dataset

SAS Enterprise Miner Example: Multiple Regression using Boston Housing dataset

SAS Enterprise Miner Residuals for Validation dataset

SAS Enterprise Miner Boxplot of residuals

Lift Chart Order predicted y-value from highest to lowest X-axis = Cases from 1 to n Y-axis = Cumulative predicted value of Y Two lines are plotted … one with  as the predicted value one with ŷ found using a prediction model

Excel Miner: Lift Chart

Decile-wise lift Chart Order predicted y-value from highest to lowest X-axis = % of cases from 10% to 100% i.e. 1 st decile to 10 th decile Y-axis = Cumulative predicted value of Y For each decile the ratio of sum of ŷ and sum of  is potted

XLMINER: Decile-wise lift Chart

Accuracy Measures (Classification)

Misclassification error Error = classifying a record as belonging to one class when it belongs to another class. Error rate = percent of misclassified records out of the total records in the validation data

Benchmark: Naïve Rule Naïve rule: classify all records as belonging to the most prevalent class Used only as benchmark to evaluate more complex classifiers

Separation of Records “High separation of records” means that using predictor variables attains low error “Low separation of records” means that using predictor variables does not improve much on naïve rule

High Separation of Records

Low Separation of Records

Classification Confusion Matrix 201 1’s correctly classified as “1” (True positives) 85 1’s incorrectly classified as “0” (False negatives) 25 0’s incorrectly classified as “1” (False positives) 2689 0’s correctly classified as “0” (True negatives)

Error Rate Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 – err = (201+2689) = 96.33% If multiple classes, error rate is: (sum of misclassified records)/(total records)

Classification Matrix: Meaning of each cell Actual Class Predicted Class C1C1 C2C2 C1C1 n 1,1 = No. of C 1 cases classified correctly as C 1 n 1,2 = No. of C 1 cases classified incorrectly as C 2 C2C2 n 2,1 = No. of C 2 cases classified incorrectly as C 1 n 2,2 = No. of C 2 cases classified correctly as C 2

Propensity Propensities are estimated probabilities that a case belongs to each of the classes It is used in two ways … To generate predicted class membership Rank ordering cases by probability of belonging to a particular class of interest

Most Data Mining algorithms classify via a 2-step process: For each case, 1. Compute probability of belonging to the class of interest 2. Compare to cutoff value, and classify accordingly Default cutoff value is 0.50 If >= 0.50, classify as “1” If < 0.50, classify as “0” Can use different cutoff values Typically, error rate is lowest for cutoff = 0.50 Propensity and Cutoff for Classification

Cutoff Table If cutoff is 0.50: thirteen records are classified as “1” If cutoff is 0.80: seven records are classified as “1”

Confusion Matrix for Different Cutoffs

When One Class is More Important In many cases it is more important to identify members of one class Tax fraud Credit default Response to promotional offer Predicting delayed flights In such cases, overall accuracy is not a good measure of evaluating the classifier. Sensitivity Specificity

Sensitivity

Specificity

ROC Curve Receiver Operating Characteristics Curve

Asymmetric Misclassification Costs

Misclassification Costs May Differ The cost of making a misclassification error may be higher for one class than the other(s) Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)

Example – Response to Promotional Offer “Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good) Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse) Predict Class 0Predict Class 1 Actual 09900 Actual 1100

The Confusion Matrix Error rate = (2+20) = 2.2% (higher than naïve rate) Predict Class 0Predict Class 1 Actual 097020 Actual 128 Suppose that using DM we can correctly classify eight 1’s as 1’s It comes at the cost of misclassifying twenty 0’s as 1’s and two 1’s as 0’s.

Introducing Costs & Benefits Suppose: Profit from a “1” is $10 Cost of sending offer is $1 Then: Under naïve rule, all are classified as “0”, so no offers are sent: no cost, no profit Under DM predictions, 28 offers are sent. 8 respond with profit of $10 each 20 fail to respond, cost $1 each 972 receive nothing (no cost, no profit) Net profit = $80 - $20 = $60

Profit Matrix Predict Class 0Predict Class 1 Actual 00-$20 Actual 10$80

Minimize Opportunity costs As we see, best to convert everything to costs, as opposed to a mix of costs and benefits E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale” Leads to same decisions, but referring only to costs allows greater applicability

Cost Matrix with opportunity costs Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1): CostsPredict Class 0Predict Class 1 Actual 0970 x $0 = $020 x $1 = $20 Actual 12 x $10 = $208 x $1 = $8 Predict Class 0Predict Class 1 Actual 097020 Actual 128 Total opportunity cost = 0 + 20 + 20 + 8 = 48

Average misclassification cost

Generalize to Cost Ratio Sometimes actual costs and benefits are hard to estimate Need to express everything in terms of costs (i.e., cost of misclassification per record) A good practical substitute for individual costs is the ratio of misclassification costs (e.g., “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)

Multiple Classes Theoretically, there are m(m-1) misclassification costs, since any case from one of the m classes could be misclassified into any one of the m-1 other classes Practically too many to work with In decision-making context, though, such complexity rarely arises – one class is usually of primary interest For m classes, confusion matrix has m rows and m columns

Judging Ranking Performance

Lift Chart for Binary Data Input: Scored validation dataset Actual class and propensity (probability) to belong to the class of interest C 1 ) Sort records in descending order of propensity to belong to the class of interest Compute cumulative number of C 1 members for each row Lift chart is the plot with row number (no. of records) as the x-axis and cumulative number of C 1 members as the y-axis

Lift Chart for Binary Data - Example Case No. Propensity (Predicted probability of belonging to class "1") Actual class Cumulative actual classes 10.995976711 20.987533112 30.984456413 40.980439614 50.948116415 60.889297216 70.847631917 80.762806307 90.706991918 100.680754119 110.6563437110 120.6224195010 130.5055069111 140.4713405011 150.3371174011 160.2179678112 170.1992404012 180.1494827012 190.0479626012 200.0383414012 210.0248510012 220.0218060012 230.0161299012 240.0035600012

Lift Chart with Costs and Benefits Sort records in descending order of probability of success (success = belonging to class of interest) For each record compute cost/benefit with actual outcome Compute a column of cumulative cost/benefit Plot cumulative cost/benefit on y-axis and row number (no. of records) as the x-axis

Lift Chart with cost/benefit - Example Case No. Propensity (Predicted probability of belonging to class "1") Actual class Cost/ Benefit Cumulative Cost/benefit 10.9959767110 20.987533111020 30.984456411030 40.980439611040 50.948116411050 60.889297211060 70.847631911070 80.7628063069 90.706991911079 100.680754111089 110.656343711099 120.6224195098 130.5055069110108 140.47134050107 150.33711740106 160.2179678110116 170.19924040115 180.14948270114 190.04796260113 200.03834140112 210.02485100111 220.02180600110 230.01612990109 240.00356000108

Lift Curve May Go Negative If total net benefit from all cases is negative, reference line will have negative slope Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum

Negative slope to reference curve

Oversampling and Asymmetric Costs

Rare Cases Responder to mailing Someone who commits fraud Debt defaulter Often we oversample rare cases to give model more information to work with Typically use 50% “1” and 50% “0” for training Asymmetric costs/benefits typically go hand in hand with presence of rare but important class

Example Following graphs show optimal classification under three scenarios: assuming equal costs of misclassification assuming that misclassifying “o” is five times the cost of misclassifying “x” Oversampling scheme allowing DM methods to incorporate asymmetric costs

Classification: equal costs

Classification: Unequal costs Suppose that failing to catch “o” is 5 times as costly as failing to catch “x”.

Oversampling for asymmetric costs Oversample “o” to appropriately weight misclassification costs – without or with replacement

Equal number of responders Sample equal number of responders and non- responders

An Oversampling Procedure 1. Separate the responders (rare) from non- responders 2. Randomly assign half the responders to the training sample, plus equal number of non- responders 3. Remaining responders go to validation sample 4. Add non-responders to validation data, to maintain original ratio of responders to non-responders 5. Randomly take test set (if needed) from validation

Assessing model performance 1. Score the model with a validation dataset selected without oversampling 2. Score the model with a oversampled validation dataset and reweight the results to remove the effects of oversampling Method 1 is straightforward and easier to implement.

Adjusting confusion matrix for oversampling Whole dataSample Responders2%50% Nonresponders98%50% Example: One responder in whole data = 50/2 = 25 in the sample One nonresponder in whole data = 50/98 = 0.5102 in the sample

Adjusting confusion matrix for oversampling Suppose that the confusion matrix with oversampled validation dataset is as follows: Classification matrix, Oversampled data (Validation) Predicted 0Predcited 1Total Actual 0390110500 Actual 180420500 Total4705301000 Misclassification rate = (80 + 110)/1000 = 0.19 or 19% Percentage of records predicted as “1” = 530/1000 = 0.53 or 53%

Adjusting confusion matrix for oversampling Classification matrix, Reweighted Predicted 0Predcited 1Total Actual 0390/0.5102 = 764.4110/0.5102 = 215.6980 Actual 180/25 = 3.2420/25 = 16.8500 Total767.6232.41000 Weight for responder (Actual 1) = 25 Weight for nonresponder (Actual 0) = 0.5102 Misclassification rate = (3.2 + 215.6)/1000 = 0.219 or 21.9% Percentage of records predicted as “1” = 232.4/1000 = 0.2324 or 23.24%

Adjusting Lift curve for Oversampling Sort records in descending order of probability of success (success = belonging to class of interest) For each record compute cost/benefit with actual outcome Divide the value by the oversampling rate of the actual class Compute a cumulative column of weighted cost/benefit Plot the cumulative weighted cost/benefit on y-axis and row number (no. of records) as the x-axis

Adjusting Lift curve for Oversampling -- Example Cost/Benefit Oversampling weight Actual 0-30.6 Actual 15020 Suppose that cost/benefit and oversampling weights are as follows: Case No. Propensity (Predicted probability of belonging to class "1")Actual class weighted Cost/Benefit Cumulative weighted Cost/benefit 10.995976712.50 20.987533112.505.00 30.984456412.507.50 40.980439612.5010.00 50.948116412.5012.50 60.889297212.5015.00 70.847631912.5017.50 80.76280630-5.0012.50 90.706991912.5015.00 100.680754112.5017.50 110.656343712.5020.00 120.62241950-5.0015.00 130.505506912.5017.50 140.47134050-5.0012.50 150.33711740-5.007.50 160.217967812.5010.00 170.19924040-5.005.00 180.14948270-5.000.00 190.04796260-5.00 200.03834140-5.00-10.00 210.02485100-5.00-15.00 220.02180600-5.00-20.00 230.01612990-5.00-25.00 240.00356000-5.00-30.00

Adjusting Lift curve for Oversampling -- Example

Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.

Similar presentations

Presentation on theme: "Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.

Similar presentations

Presentation on theme: "Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce."— Presentation transcript:

Similar presentations

About project

Feedback