Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn Some overheads from Galit Shmueli and Peter Bruce 2010.

Slides:



Advertisements
Similar presentations
Chapter 4 Evaluating Classification and Predictive Performance
Advertisements

Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Chapter 5 – Evaluating Classification & Predictive Performance
Chapter 4 – Evaluating Classification & Predictive Performance © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel.
Chapter 8 – Logistic Regression
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Chapter 7 – K-Nearest-Neighbor
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Decision Tree Models in Data Mining
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
Evaluation – next steps
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Performance measurement. Must be careful what performance metric we use For example, say we have a NN classifier with 1 output unit, and we code ‘1 =
Chapter 9 – Classification and Regression Trees
Chapter 12 – Discriminant Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
CpSc 810: Machine Learning Evaluation of Classifier.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Evaluating Results of Learning Blaž Zupan
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Evaluating Classification Performance
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
CHAPTER 7 Decision Analytic Thinking I: What Is a Good Model?
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning: Ensemble Methods
7. Performance Measurement
PSY 626: Bayesian Statistics for Psychological Science
Chapter 12 – Discriminant Analysis
CHAPTER 26 Comparing Counts.
Hypothesis Testing.
Evaluation – next steps
Performance Evaluation 02/15/17
Review You run a t-test and get a result of t = 0.5. What is your conclusion? Reject the null hypothesis because t is bigger than expected by chance Reject.
Evaluating Results of Learning
9. Credibility: Evaluating What’s Been Learned
Understanding Results
Performance Measures II
Chapter 7 – K-Nearest-Neighbor
Data Mining Classification: Alternative Techniques
Statistical Process Control
PSY 626: Bayesian Statistics for Psychological Science
Chapter 9 Hypothesis Testing.
Evaluation and Its Methods
Experiments in Machine Learning
Receiver Operating Curves
Learning Algorithm Evaluation
Model Evaluation and Selection
Inferential Statistics
Computational Intelligence: Methods and Applications
False discovery rate estimation
Psych 231: Research Methods in Psychology
Dr. Sampath Jayarathna Cal Poly Pomona
Psych 231: Research Methods in Psychology
Evaluation and Its Methods
Psych 231: Research Methods in Psychology
Roc curves By Vittoria Cozza, matr
MIS2502: Data Analytics Classification Using Decision Trees
Dr. Sampath Jayarathna Cal Poly Pomona
Evaluation and Its Methods
STA 291 Spring 2008 Lecture 17 Dustin Lueker.
Presentation transcript:

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn Some overheads from Galit Shmueli and Peter Bruce 2010

Most accurate ≠ Best! Actual value Which is more accurate?? Some overheads from Galit Shmueli and Peter Bruce 2010

Why Evaluate a classifier? Multiple methods are available to classify or predict For each method, multiple choices are available for settings To choose best model, need to assess each model’s performance When you use it, how good will it be? How “reliable” are the results? Some overheads from Galit Shmueli and Peter Bruce 2010

Misclassification error for discrete classification problems Error = classifying a record as belonging to one class when it belongs to another class. Error rate = percent of misclassified records out of the total records in the validation data Some overheads from Galit Shmueli and Peter Bruce 2010

examples Some overheads from Galit Shmueli and Peter Bruce 2010

Confusion Matrix 201 1’s correctly classified as “1” 85 1’s incorrectly classified as “0” 25 0’s incorrectly classified as “1” 2689 0’s correctly classified as “0” Note asymmetry: Many more 0s than 1s Generally 1 = “interesting” or rare case Some overheads from Galit Shmueli and Peter Bruce 2010

Naïve Rule Naïve rule: classify all records as belonging to the most prevalent class Often used as benchmark: we hope to do better than that Exception: when goal is to identify high-value but rare outcomes, we may prefer to do worse than the naïve rule (see “lift” – later) Some overheads from Galit Shmueli and Peter Bruce 2010

x = inputs f(x) = prediction Y = true result Some overheads from Galit Shmueli and Peter Bruce 2010

Error Rate Overall Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 – err = (201+2689) = 96.33% Expand to n x n matrix, where n = # of alternatives If multiple classes, error rate is: (sum of misclassified records)/(total records) Some overheads from Galit Shmueli and Peter Bruce 2010

Cutoff for classification Most DM algorithms classify via a 2-step process: For each record, Compute probability of belonging to class “1” Compare to cutoff value, and classify accordingly Default cutoff value is often chosen = 0.50 If >= 0.50, classify as “1” If < 0.50, classify as “0” May prefer a different cutoff value Some overheads from Galit Shmueli and Peter Bruce 2010

Some overheads from Galit Shmueli and Peter Bruce 2010 From Di Dai. For now, pretend it is continuous data. Some overheads from Galit Shmueli and Peter Bruce 2010

Example: 24 cases classified. Est prob. Range .004 to .996 Obs. # Est. probability Actual Est Prob. 1 0.996 13 0.506 2 0.988 14 0.471 3 0.984 15 0.337 4 0.98 16 0.218 5 0.948 17 0.199 6 0.889 18 0.149 7 0.949 19 0.048 8 0.762 20 0.038 9 0.707 21 0.025 10 0.681 22 0.022 11 0.656 23 0.016 12 0.622 24 0.004 Some overheads from Galit Shmueli and Peter Bruce 2010

Set cutoff = 50% Forecast = 13 type 1s, ID 1-13 Obs. # Est. probability Actual Est Prob. 1 0.996 13 0.506 2 0.988 14 0.471 3 0.984 15 0.337 4 0.98 16 0.218 5 0.948 17 0.199 6 0.889 18 0.149 7 0.949 19 0.048 8 0.762 20 0.038 9 0.707 21 0.025 10 0.681 22 0.022 11 0.656 23 0.016 12 0.622 24 0.004 Some overheads from Galit Shmueli and Peter Bruce 2010

Set cutoff = 50% Forecast = 13 type 1s, ID 1-13 Obs. # Est. probability Actual Est Prob. 1 0.996 13 0.506 2 0.988 14 0.471 3 0.984 15 0.337 4 0.98 16 0.218 5 0.948 17 0.199 6 0.889 18 0.149 7 0.949 19 0.048 8 0.762 20 0.038 9 0.707 21 0.025 10 0.681 22 0.022 11 0.656 23 0.016 12 0.622 24 0.004 Result = 4 false positives, 1 false negative, 5/24 total errors Some overheads from Galit Shmueli and Peter Bruce 2010

Set cutoff = 70% Forecast = 9 type 1s, ID 1-9 Obs. # Est. probability Actual Est Prob. 1 0.996 13 0.506 2 0.988 14 0.471 3 0.984 15 0.337 4 0.98 16 0.218 5 0.948 17 0.199 6 0.889 18 0.149 7 0.949 19 0.048 8 0.762 20 0.038 9 0.707 21 0.025 10 0.681 22 0.022 11 0.656 23 0.016 12 0.622 24 0.004 Some overheads from Galit Shmueli and Peter Bruce 2010 Result = 1 false positives, 2 false negative, (we were lucky)

Which is better?? High cutoff (70%) => more false negatives, fewer false positives Lower cutoff (50%) gives opposite Which better is not a stat question; it’s a business application question Total number of errors is usually NOT the right Which kind of error is more serious??? Only rarely are the costs the same Statistical decision theory, Bayesian logic Some overheads from Galit Shmueli and Peter Bruce 2010

When One Class is More Important In many cases it is more important to identify members of one class Tax fraud Credit default Response to promotional offer Detecting electronic network intrusion Predicting delayed flights In such cases, we are willing to tolerate greater overall error, in return for better identifying the important class for further attention Some overheads from Galit Shmueli and Peter Bruce 2010

Always a tradeoff: false + vs. false - How do we estimate these two types of errors? Answer: with our set-aside data Validation data set. Some overheads from Galit Shmueli and Peter Bruce 2010

x = inputs f(x) = prediction Y = true result Some overheads from Galit Shmueli and Peter Bruce 2010

Common performance measures None of these is a hypothesis test. (For example, is p(positive) >0? ) Reason: we don’t care about the “overall;” we care about prediction of cases. Common performance measures Some overheads from Galit Shmueli and Peter Bruce 2010

Alternate Accuracy Measures If “C1” is the important class, Sensitivity = % of “C1” class correctly classified Specificity = % of “C0” class correctly classified False positive rate = % of predicted “C1’s” that were not “C1’s” False negative rate = % of predicted “C0’s” that were not “C0’s” Some overheads from Galit Shmueli and Peter Bruce 2010

Medical terms In medical diagnostics, test sensitivity is the ability of a test to correctly identify those with the disease (true positive rate), whereas test specificity is the ability of the test to correctly identify those without the disease (true negative rate).[2] Some overheads from Galit Shmueli and Peter Bruce 2010

Significance vs power (statistics) False positive rate (α) = P(type I error) = 1 − specificity =FP / (FP + TN) = Significance level For simple hypotheses, this is the test's probability of incorrectly rejecting the null hypothesis. The false positive rate. False negative rate (β) = type II error = 1 − sensitivity = FN / (TP + FN) = 10 / (20 + 10) Power = sensitivity = 1 − β The test's probability of correctly rejecting the null hypothesis. The complement of the false negative rate, β. Some overheads from Galit Shmueli and Peter Bruce 2010

Evaluation Measures for Classification III Area under Precision Recall Curve (auPRC) Some overheads from Galit Shmueli and Peter Bruce 2010

ROC Curve False positive rate As we change the cutoff, False positive rate Some overheads from Galit Shmueli and Peter Bruce 2010

Lift and Decile Charts: Goal Useful for assessing performance in terms of identifying the most important class Helps evaluate, e.g., How many tax records to examine How many loans to grant How many customers to mail offer to Some overheads from Galit Shmueli and Peter Bruce 2010

Lift and Decile Charts – Cont. Compare performance of DM model to “no model, pick randomly” Measures ability of DM model to identify the important class, relative to its average prevalence Charts give explicit assessment of results over a large number of cutoffs Some overheads from Galit Shmueli and Peter Bruce 2010 27

Lift and Decile Charts: How to Use Compare lift to “no model” baseline In lift chart: compare step function to straight line In decile chart compare to ratio of 1 Some overheads from Galit Shmueli and Peter Bruce 2010 28

Lift Chart – cumulative performance After examining (e.g.,) 10 cases (x-axis), 9 owners (y-axis) have been correctly identified Some overheads from Galit Shmueli and Peter Bruce 2010

Decile Chart In “most probable” (top) decile, model is twice as likely to identify the important class (compared to avg. prevalence) Some overheads from Galit Shmueli and Peter Bruce 2010

Lift Charts: How to Compute Using the model’s classifications, sort records from most likely to least likely members of the important class Compute lift: Accumulate the correctly classified “important class” records (Y axis) and compare to number of total records (X axis) Some overheads from Galit Shmueli and Peter Bruce 2010

Lift vs. Decile Charts Both embody concept of “moving down” through the records, starting with the most probable Decile chart does this in decile chunks of data Y axis shows ratio of decile mean to overall mean Lift chart shows continuous cumulative results Y axis shows number of important class records identified Some overheads from Galit Shmueli and Peter Bruce 2010 32

Asymmetric Costs Some overheads from Galit Shmueli and Peter Bruce 2010

Misclassification Costs May Differ The cost of making a misclassification error may be higher for one class than the other(s) Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s) Some overheads from Galit Shmueli and Peter Bruce 2010

Example – Response to Promotional Offer Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse) “Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good) With lower cutoff, suppose we can correctly classify eight 1’s as 1’s It comes at the cost of misclassifying twenty 0’s as 1’s and two 0’s as 1’s. Some overheads from Galit Shmueli and Peter Bruce 2010

New Confusion Matrix Error rate = (2+20) = 2.2% (higher than naïve rate) Some overheads from Galit Shmueli and Peter Bruce 2010

Introducing Costs & Benefits Suppose: Profit from a “1” is $10 Cost of sending offer is $1 Then: Under naïve rule, all are classified as “0”, so no offers are sent: no cost, no profit Under DM predictions, 28 offers are sent. 8 respond with profit of $10 each 20 fail to respond, cost $1 each 972 receive nothing (no cost, no profit) Net profit = $60 Some overheads from Galit Shmueli and Peter Bruce 2010

Profit Matrix Some overheads from Galit Shmueli and Peter Bruce 2010

Lift (again) Adding costs to the mix, as above, does not change the actual classifications Better: Use the lift curve and change the cutoff value for “1” to maximize profit Some overheads from Galit Shmueli and Peter Bruce 2010

Generalize to Cost Ratio Sometimes actual costs and benefits are hard to estimate Need to express everything in terms of costs (i.e., cost of misclassification per record) Goal is to minimize the average cost per record A good practical substitute for individual costs is the ratio of misclassification costs (e,g,, “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”) Some overheads from Galit Shmueli and Peter Bruce 2010

Minimizing Cost Ratio q1 = cost of misclassifying an actual “1”, Minimizing the cost ratio q1/q0 is identical to minimizing the average cost per record Software* may provide option for user to specify cost ratio *Currently unavailable in XLMiner Some overheads from Galit Shmueli and Peter Bruce 2010 41

Note: Opportunity costs As we see, best to convert everything to costs, as opposed to a mix of costs and benefits E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale” Leads to same decisions, but referring only to costs allows greater applicability Some overheads from Galit Shmueli and Peter Bruce 2010

Cost Matrix (inc. opportunity costs) Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1): Some overheads from Galit Shmueli and Peter Bruce 2010

Adding Cost/Benefit to Lift Curve Sort records in descending probability of success For each case, record cost/benefit of actual outcome Also record cumulative cost/benefit Plot all records X-axis is index number (1 for 1st case, n for nth case) Y-axis is cumulative cost/benefit Reference line from origin to yn ( yn = total net benefit) Some overheads from Galit Shmueli and Peter Bruce 2010

Multiple Classes For m classes, confusion matrix has m rows and m columns Theoretically, there are m(m-1) misclassification costs, since any case could be misclassified in m-1 ways In practice, often too many to work with In decision-making context, though, such complexity rarely arises – one class is usually of primary interest Some overheads from Galit Shmueli and Peter Bruce 2010

Negative slope to reference curve Some overheads from Galit Shmueli and Peter Bruce 2010

Lift Curve May Go Negative If total net benefit from all cases is negative, reference line will have negative slope Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum Some overheads from Galit Shmueli and Peter Bruce 2010

Oversampling and Asymmetric Costs Caution: most real DM problems fall have asymmetric costs! Some overheads from Galit Shmueli and Peter Bruce 2010

Rare Cases Asymmetric costs/benefits typically go hand in hand with presence of rare but important class Responder to mailing Someone who commits fraud Debt defaulter Often we oversample rare cases to give model more information to work with Typically use 50% “1” and 50% “0” for training Some overheads from Galit Shmueli and Peter Bruce 2010

Example Following graphs show optimal classification under three scenarios: assuming equal costs of misclassification assuming that misclassifying “o” is five times the cost of misclassifying “x” Oversampling scheme allowing DM methods to incorporate asymmetric costs Some overheads from Galit Shmueli and Peter Bruce 2010

Classification: equal costs Some overheads from Galit Shmueli and Peter Bruce 2010

Classification: Unequal costs skip Some overheads from Galit Shmueli and Peter Bruce 2010

Oversampling Scheme Oversample “o” to appropriately weight misclassification costs skip Some overheads from Galit Shmueli and Peter Bruce 2010

An Oversampling Procedure Separate the responders (rare) from non- responders Randomly assign half the responders to the training sample, plus equal number of non- responders Remaining responders go to validation sample Add non-responders to validation data, to maintain original ratio of responders to non-responders Randomly take test set (if needed) from validation Some overheads from Galit Shmueli and Peter Bruce 2010

Classification Using Triage Take into account a gray area in making classification decisions Instead of classifying as C1 or C0, we classify as C1 C0 Can’t say The third category might receive special human review Some overheads from Galit Shmueli and Peter Bruce 2010

Evaluating Predictive Performance Some overheads from Galit Shmueli and Peter Bruce 2010

Measuring Predictive error Not the same as “goodness-of-fit” We want to know how well the model predicts new data, not how well it fits the data it was trained with Key component of most measures is difference between actual y and predicted y (“error”) Some overheads from Galit Shmueli and Peter Bruce 2010

Some measures of error MAE or MAD: Mean absolute error (deviation) Gives an idea of the magnitude of errors Average error Gives an idea of systematic over- or under-prediction MAPE: Mean absolute percentage error RMSE (root-mean-squared-error): Square the errors, find their average, take the square root Total SSE: Total sum of squared errors Some overheads from Galit Shmueli and Peter Bruce 2010

Lift Chart for Predictive Error Similar to lift chart for classification, except… Y axis is cumulative value of numeric target variable (e.g., revenue), instead of cumulative count of “responses” Some overheads from Galit Shmueli and Peter Bruce 2010

Lift chart example – spending Some overheads from Galit Shmueli and Peter Bruce 2010

Summary Evaluation metrics are important for comparing across DM models, for choosing the right configuration of a specific DM model, and for comparing to the baseline Major metrics: confusion matrix, error rate, predictive error Other metrics when one class is more important asymmetric costs When important class is rare, use oversampling In all cases, metrics computed from validation data Some overheads from Galit Shmueli and Peter Bruce 2010