Download presentation
Presentation is loading. Please wait.
Published byStephanie Robinson Modified over 8 years ago
1
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce
2
Why Evaluate? Multiple methods are available to classify or predict For each method, multiple choices are available for settings To choose best model, need to assess each model’s performance
3
Types of outcome Predicted numerical value Outcome variable is numerical and continuous Predicted class membership Outcome variable is categorical Propensity Probability of class membership when the outcome variable is categorical
4
Accuracy Measures (Continuous outcome)
5
Evaluating Predictive Performance Not the same as goodness-of-fit (R 2, S Y/X ) Predicted performance is measured using the validation dataset Benchmark is average used as prediction
6
Prediction accuracy measures
7
Excel Miner Example: Multiple Regression using Boston Housing dataset
8
Excel Miner Residuals for Training dataset Residuals for Validation dataset
9
Excel Miner Residuals for Training dataset Residuals for Validation dataset
10
SAS Enterprise Miner Example: Multiple Regression using Boston Housing dataset
11
SAS Enterprise Miner Residuals for Validation dataset
12
SAS Enterprise Miner Boxplot of residuals
13
Lift Chart Order predicted y-value from highest to lowest X-axis = Cases from 1 to n Y-axis = Cumulative predicted value of Y Two lines are plotted … one with as the predicted value one with ŷ found using a prediction model
14
Excel Miner: Lift Chart
15
Decile-wise lift Chart Order predicted y-value from highest to lowest X-axis = % of cases from 10% to 100% i.e. 1 st decile to 10 th decile Y-axis = Cumulative predicted value of Y For each decile the ratio of sum of ŷ and sum of is potted
16
XLMINER: Decile-wise lift Chart
17
Accuracy Measures (Classification)
18
Misclassification error Error = classifying a record as belonging to one class when it belongs to another class. Error rate = percent of misclassified records out of the total records in the validation data
19
Benchmark: Naïve Rule Naïve rule: classify all records as belonging to the most prevalent class Used only as benchmark to evaluate more complex classifiers
20
Separation of Records “High separation of records” means that using predictor variables attains low error “Low separation of records” means that using predictor variables does not improve much on naïve rule
21
High Separation of Records
22
Low Separation of Records
23
Classification Confusion Matrix 201 1’s correctly classified as “1” (True positives) 85 1’s incorrectly classified as “0” (False negatives) 25 0’s incorrectly classified as “1” (False positives) 2689 0’s correctly classified as “0” (True negatives)
24
Error Rate Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 – err = (201+2689) = 96.33% If multiple classes, error rate is: (sum of misclassified records)/(total records)
25
Classification Matrix: Meaning of each cell Actual Class Predicted Class C1C1 C2C2 C1C1 n 1,1 = No. of C 1 cases classified correctly as C 1 n 1,2 = No. of C 1 cases classified incorrectly as C 2 C2C2 n 2,1 = No. of C 2 cases classified incorrectly as C 1 n 2,2 = No. of C 2 cases classified correctly as C 2
26
Propensity Propensities are estimated probabilities that a case belongs to each of the classes It is used in two ways … To generate predicted class membership Rank ordering cases by probability of belonging to a particular class of interest
27
Most Data Mining algorithms classify via a 2-step process: For each case, 1. Compute probability of belonging to the class of interest 2. Compare to cutoff value, and classify accordingly Default cutoff value is 0.50 If >= 0.50, classify as “1” If < 0.50, classify as “0” Can use different cutoff values Typically, error rate is lowest for cutoff = 0.50 Propensity and Cutoff for Classification
28
Cutoff Table If cutoff is 0.50: thirteen records are classified as “1” If cutoff is 0.80: seven records are classified as “1”
29
Confusion Matrix for Different Cutoffs
31
When One Class is More Important In many cases it is more important to identify members of one class Tax fraud Credit default Response to promotional offer Predicting delayed flights In such cases, overall accuracy is not a good measure of evaluating the classifier. Sensitivity Specificity
32
Sensitivity
33
Specificity
34
ROC Curve Receiver Operating Characteristics Curve
35
Asymmetric Misclassification Costs
36
Misclassification Costs May Differ The cost of making a misclassification error may be higher for one class than the other(s) Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)
37
Example – Response to Promotional Offer “Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good) Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse) Predict Class 0Predict Class 1 Actual 09900 Actual 1100
38
The Confusion Matrix Error rate = (2+20) = 2.2% (higher than naïve rate) Predict Class 0Predict Class 1 Actual 097020 Actual 128 Suppose that using DM we can correctly classify eight 1’s as 1’s It comes at the cost of misclassifying twenty 0’s as 1’s and two 1’s as 0’s.
39
Introducing Costs & Benefits Suppose: Profit from a “1” is $10 Cost of sending offer is $1 Then: Under naïve rule, all are classified as “0”, so no offers are sent: no cost, no profit Under DM predictions, 28 offers are sent. 8 respond with profit of $10 each 20 fail to respond, cost $1 each 972 receive nothing (no cost, no profit) Net profit = $80 - $20 = $60
40
Profit Matrix Predict Class 0Predict Class 1 Actual 00-$20 Actual 10$80
41
Minimize Opportunity costs As we see, best to convert everything to costs, as opposed to a mix of costs and benefits E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale” Leads to same decisions, but referring only to costs allows greater applicability
42
Cost Matrix with opportunity costs Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1): CostsPredict Class 0Predict Class 1 Actual 0970 x $0 = $020 x $1 = $20 Actual 12 x $10 = $208 x $1 = $8 Predict Class 0Predict Class 1 Actual 097020 Actual 128 Total opportunity cost = 0 + 20 + 20 + 8 = 48
43
Average misclassification cost
44
Generalize to Cost Ratio Sometimes actual costs and benefits are hard to estimate Need to express everything in terms of costs (i.e., cost of misclassification per record) A good practical substitute for individual costs is the ratio of misclassification costs (e.g., “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)
45
Multiple Classes Theoretically, there are m(m-1) misclassification costs, since any case from one of the m classes could be misclassified into any one of the m-1 other classes Practically too many to work with In decision-making context, though, such complexity rarely arises – one class is usually of primary interest For m classes, confusion matrix has m rows and m columns
46
Judging Ranking Performance
47
Lift Chart for Binary Data Input: Scored validation dataset Actual class and propensity (probability) to belong to the class of interest C 1 ) Sort records in descending order of propensity to belong to the class of interest Compute cumulative number of C 1 members for each row Lift chart is the plot with row number (no. of records) as the x-axis and cumulative number of C 1 members as the y-axis
48
Lift Chart for Binary Data - Example Case No. Propensity (Predicted probability of belonging to class "1") Actual class Cumulative actual classes 10.995976711 20.987533112 30.984456413 40.980439614 50.948116415 60.889297216 70.847631917 80.762806307 90.706991918 100.680754119 110.6563437110 120.6224195010 130.5055069111 140.4713405011 150.3371174011 160.2179678112 170.1992404012 180.1494827012 190.0479626012 200.0383414012 210.0248510012 220.0218060012 230.0161299012 240.0035600012
49
Lift Chart with Costs and Benefits Sort records in descending order of probability of success (success = belonging to class of interest) For each record compute cost/benefit with actual outcome Compute a column of cumulative cost/benefit Plot cumulative cost/benefit on y-axis and row number (no. of records) as the x-axis
50
Lift Chart with cost/benefit - Example Case No. Propensity (Predicted probability of belonging to class "1") Actual class Cost/ Benefit Cumulative Cost/benefit 10.9959767110 20.987533111020 30.984456411030 40.980439611040 50.948116411050 60.889297211060 70.847631911070 80.7628063069 90.706991911079 100.680754111089 110.656343711099 120.6224195098 130.5055069110108 140.47134050107 150.33711740106 160.2179678110116 170.19924040115 180.14948270114 190.04796260113 200.03834140112 210.02485100111 220.02180600110 230.01612990109 240.00356000108
51
Lift Curve May Go Negative If total net benefit from all cases is negative, reference line will have negative slope Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum
52
Negative slope to reference curve
53
Oversampling and Asymmetric Costs
54
Rare Cases Responder to mailing Someone who commits fraud Debt defaulter Often we oversample rare cases to give model more information to work with Typically use 50% “1” and 50% “0” for training Asymmetric costs/benefits typically go hand in hand with presence of rare but important class
55
Example Following graphs show optimal classification under three scenarios: assuming equal costs of misclassification assuming that misclassifying “o” is five times the cost of misclassifying “x” Oversampling scheme allowing DM methods to incorporate asymmetric costs
56
Classification: equal costs
57
Classification: Unequal costs Suppose that failing to catch “o” is 5 times as costly as failing to catch “x”.
58
Oversampling for asymmetric costs Oversample “o” to appropriately weight misclassification costs – without or with replacement
59
Equal number of responders Sample equal number of responders and non- responders
60
An Oversampling Procedure 1. Separate the responders (rare) from non- responders 2. Randomly assign half the responders to the training sample, plus equal number of non- responders 3. Remaining responders go to validation sample 4. Add non-responders to validation data, to maintain original ratio of responders to non-responders 5. Randomly take test set (if needed) from validation
61
Assessing model performance 1. Score the model with a validation dataset selected without oversampling 2. Score the model with a oversampled validation dataset and reweight the results to remove the effects of oversampling Method 1 is straightforward and easier to implement.
62
Adjusting confusion matrix for oversampling Whole dataSample Responders2%50% Nonresponders98%50% Example: One responder in whole data = 50/2 = 25 in the sample One nonresponder in whole data = 50/98 = 0.5102 in the sample
63
Adjusting confusion matrix for oversampling Suppose that the confusion matrix with oversampled validation dataset is as follows: Classification matrix, Oversampled data (Validation) Predicted 0Predcited 1Total Actual 0390110500 Actual 180420500 Total4705301000 Misclassification rate = (80 + 110)/1000 = 0.19 or 19% Percentage of records predicted as “1” = 530/1000 = 0.53 or 53%
64
Adjusting confusion matrix for oversampling Classification matrix, Reweighted Predicted 0Predcited 1Total Actual 0390/0.5102 = 764.4110/0.5102 = 215.6980 Actual 180/25 = 3.2420/25 = 16.8500 Total767.6232.41000 Weight for responder (Actual 1) = 25 Weight for nonresponder (Actual 0) = 0.5102 Misclassification rate = (3.2 + 215.6)/1000 = 0.219 or 21.9% Percentage of records predicted as “1” = 232.4/1000 = 0.2324 or 23.24%
65
Adjusting Lift curve for Oversampling Sort records in descending order of probability of success (success = belonging to class of interest) For each record compute cost/benefit with actual outcome Divide the value by the oversampling rate of the actual class Compute a cumulative column of weighted cost/benefit Plot the cumulative weighted cost/benefit on y-axis and row number (no. of records) as the x-axis
66
Adjusting Lift curve for Oversampling -- Example Cost/Benefit Oversampling weight Actual 0-30.6 Actual 15020 Suppose that cost/benefit and oversampling weights are as follows: Case No. Propensity (Predicted probability of belonging to class "1")Actual class weighted Cost/Benefit Cumulative weighted Cost/benefit 10.995976712.50 20.987533112.505.00 30.984456412.507.50 40.980439612.5010.00 50.948116412.5012.50 60.889297212.5015.00 70.847631912.5017.50 80.76280630-5.0012.50 90.706991912.5015.00 100.680754112.5017.50 110.656343712.5020.00 120.62241950-5.0015.00 130.505506912.5017.50 140.47134050-5.0012.50 150.33711740-5.007.50 160.217967812.5010.00 170.19924040-5.005.00 180.14948270-5.000.00 190.04796260-5.00 200.03834140-5.00-10.00 210.02485100-5.00-15.00 220.02180600-5.00-20.00 230.01612990-5.00-25.00 240.00356000-5.00-30.00
67
Adjusting Lift curve for Oversampling -- Example
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.