Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce
Why Evaluate? Multiple methods are available to classify or predict For each method, multiple choices are available for settings To choose best model, need to assess each model’s performance
Types of outcome Predicted numerical value Outcome variable is numerical and continuous Predicted class membership Outcome variable is categorical Propensity Probability of class membership when the outcome variable is categorical
Accuracy Measures (Continuous outcome)
Evaluating Predictive Performance Not the same as goodness-of-fit (R 2, S Y/X ) Predicted performance is measured using the validation dataset Benchmark is average used as prediction
Prediction accuracy measures
Excel Miner Example: Multiple Regression using Boston Housing dataset
Excel Miner Residuals for Training dataset Residuals for Validation dataset
Excel Miner Residuals for Training dataset Residuals for Validation dataset
SAS Enterprise Miner Example: Multiple Regression using Boston Housing dataset
SAS Enterprise Miner Residuals for Validation dataset
SAS Enterprise Miner Boxplot of residuals
Lift Chart Order predicted y-value from highest to lowest X-axis = Cases from 1 to n Y-axis = Cumulative predicted value of Y Two lines are plotted … one with as the predicted value one with ŷ found using a prediction model
Excel Miner: Lift Chart
Decile-wise lift Chart Order predicted y-value from highest to lowest X-axis = % of cases from 10% to 100% i.e. 1 st decile to 10 th decile Y-axis = Cumulative predicted value of Y For each decile the ratio of sum of ŷ and sum of is potted
XLMINER: Decile-wise lift Chart
Accuracy Measures (Classification)
Misclassification error Error = classifying a record as belonging to one class when it belongs to another class. Error rate = percent of misclassified records out of the total records in the validation data
Benchmark: Naïve Rule Naïve rule: classify all records as belonging to the most prevalent class Used only as benchmark to evaluate more complex classifiers
Separation of Records “High separation of records” means that using predictor variables attains low error “Low separation of records” means that using predictor variables does not improve much on naïve rule
High Separation of Records
Low Separation of Records
Classification Confusion Matrix 201 1’s correctly classified as “1” (True positives) 85 1’s incorrectly classified as “0” (False negatives) 25 0’s incorrectly classified as “1” (False positives) ’s correctly classified as “0” (True negatives)
Error Rate Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 – err = ( ) = 96.33% If multiple classes, error rate is: (sum of misclassified records)/(total records)
Classification Matrix: Meaning of each cell Actual Class Predicted Class C1C1 C2C2 C1C1 n 1,1 = No. of C 1 cases classified correctly as C 1 n 1,2 = No. of C 1 cases classified incorrectly as C 2 C2C2 n 2,1 = No. of C 2 cases classified incorrectly as C 1 n 2,2 = No. of C 2 cases classified correctly as C 2
Propensity Propensities are estimated probabilities that a case belongs to each of the classes It is used in two ways … To generate predicted class membership Rank ordering cases by probability of belonging to a particular class of interest
Most Data Mining algorithms classify via a 2-step process: For each case, 1. Compute probability of belonging to the class of interest 2. Compare to cutoff value, and classify accordingly Default cutoff value is 0.50 If >= 0.50, classify as “1” If < 0.50, classify as “0” Can use different cutoff values Typically, error rate is lowest for cutoff = 0.50 Propensity and Cutoff for Classification
Cutoff Table If cutoff is 0.50: thirteen records are classified as “1” If cutoff is 0.80: seven records are classified as “1”
Confusion Matrix for Different Cutoffs
When One Class is More Important In many cases it is more important to identify members of one class Tax fraud Credit default Response to promotional offer Predicting delayed flights In such cases, overall accuracy is not a good measure of evaluating the classifier. Sensitivity Specificity
Sensitivity
Specificity
ROC Curve Receiver Operating Characteristics Curve
Asymmetric Misclassification Costs
Misclassification Costs May Differ The cost of making a misclassification error may be higher for one class than the other(s) Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)
Example – Response to Promotional Offer “Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good) Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse) Predict Class 0Predict Class 1 Actual Actual 1100
The Confusion Matrix Error rate = (2+20) = 2.2% (higher than naïve rate) Predict Class 0Predict Class 1 Actual Actual 128 Suppose that using DM we can correctly classify eight 1’s as 1’s It comes at the cost of misclassifying twenty 0’s as 1’s and two 1’s as 0’s.
Introducing Costs & Benefits Suppose: Profit from a “1” is $10 Cost of sending offer is $1 Then: Under naïve rule, all are classified as “0”, so no offers are sent: no cost, no profit Under DM predictions, 28 offers are sent. 8 respond with profit of $10 each 20 fail to respond, cost $1 each 972 receive nothing (no cost, no profit) Net profit = $80 - $20 = $60
Profit Matrix Predict Class 0Predict Class 1 Actual 00-$20 Actual 10$80
Minimize Opportunity costs As we see, best to convert everything to costs, as opposed to a mix of costs and benefits E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale” Leads to same decisions, but referring only to costs allows greater applicability
Cost Matrix with opportunity costs Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1): CostsPredict Class 0Predict Class 1 Actual 0970 x $0 = $020 x $1 = $20 Actual 12 x $10 = $208 x $1 = $8 Predict Class 0Predict Class 1 Actual Actual 128 Total opportunity cost = = 48
Average misclassification cost
Generalize to Cost Ratio Sometimes actual costs and benefits are hard to estimate Need to express everything in terms of costs (i.e., cost of misclassification per record) A good practical substitute for individual costs is the ratio of misclassification costs (e.g., “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)
Multiple Classes Theoretically, there are m(m-1) misclassification costs, since any case from one of the m classes could be misclassified into any one of the m-1 other classes Practically too many to work with In decision-making context, though, such complexity rarely arises – one class is usually of primary interest For m classes, confusion matrix has m rows and m columns
Judging Ranking Performance
Lift Chart for Binary Data Input: Scored validation dataset Actual class and propensity (probability) to belong to the class of interest C 1 ) Sort records in descending order of propensity to belong to the class of interest Compute cumulative number of C 1 members for each row Lift chart is the plot with row number (no. of records) as the x-axis and cumulative number of C 1 members as the y-axis
Lift Chart for Binary Data - Example Case No. Propensity (Predicted probability of belonging to class "1") Actual class Cumulative actual classes
Lift Chart with Costs and Benefits Sort records in descending order of probability of success (success = belonging to class of interest) For each record compute cost/benefit with actual outcome Compute a column of cumulative cost/benefit Plot cumulative cost/benefit on y-axis and row number (no. of records) as the x-axis
Lift Chart with cost/benefit - Example Case No. Propensity (Predicted probability of belonging to class "1") Actual class Cost/ Benefit Cumulative Cost/benefit
Lift Curve May Go Negative If total net benefit from all cases is negative, reference line will have negative slope Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum
Negative slope to reference curve
Oversampling and Asymmetric Costs
Rare Cases Responder to mailing Someone who commits fraud Debt defaulter Often we oversample rare cases to give model more information to work with Typically use 50% “1” and 50% “0” for training Asymmetric costs/benefits typically go hand in hand with presence of rare but important class
Example Following graphs show optimal classification under three scenarios: assuming equal costs of misclassification assuming that misclassifying “o” is five times the cost of misclassifying “x” Oversampling scheme allowing DM methods to incorporate asymmetric costs
Classification: equal costs
Classification: Unequal costs Suppose that failing to catch “o” is 5 times as costly as failing to catch “x”.
Oversampling for asymmetric costs Oversample “o” to appropriately weight misclassification costs – without or with replacement
Equal number of responders Sample equal number of responders and non- responders
An Oversampling Procedure 1. Separate the responders (rare) from non- responders 2. Randomly assign half the responders to the training sample, plus equal number of non- responders 3. Remaining responders go to validation sample 4. Add non-responders to validation data, to maintain original ratio of responders to non-responders 5. Randomly take test set (if needed) from validation
Assessing model performance 1. Score the model with a validation dataset selected without oversampling 2. Score the model with a oversampled validation dataset and reweight the results to remove the effects of oversampling Method 1 is straightforward and easier to implement.
Adjusting confusion matrix for oversampling Whole dataSample Responders2%50% Nonresponders98%50% Example: One responder in whole data = 50/2 = 25 in the sample One nonresponder in whole data = 50/98 = in the sample
Adjusting confusion matrix for oversampling Suppose that the confusion matrix with oversampled validation dataset is as follows: Classification matrix, Oversampled data (Validation) Predicted 0Predcited 1Total Actual Actual Total Misclassification rate = ( )/1000 = 0.19 or 19% Percentage of records predicted as “1” = 530/1000 = 0.53 or 53%
Adjusting confusion matrix for oversampling Classification matrix, Reweighted Predicted 0Predcited 1Total Actual 0390/ = / = Actual 180/25 = /25 = Total Weight for responder (Actual 1) = 25 Weight for nonresponder (Actual 0) = Misclassification rate = ( )/1000 = or 21.9% Percentage of records predicted as “1” = 232.4/1000 = or 23.24%
Adjusting Lift curve for Oversampling Sort records in descending order of probability of success (success = belonging to class of interest) For each record compute cost/benefit with actual outcome Divide the value by the oversampling rate of the actual class Compute a cumulative column of weighted cost/benefit Plot the cumulative weighted cost/benefit on y-axis and row number (no. of records) as the x-axis
Adjusting Lift curve for Oversampling -- Example Cost/Benefit Oversampling weight Actual Actual Suppose that cost/benefit and oversampling weights are as follows: Case No. Propensity (Predicted probability of belonging to class "1")Actual class weighted Cost/Benefit Cumulative weighted Cost/benefit
Adjusting Lift curve for Oversampling -- Example