Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Evaluation of segmentation. Example Reference standard & segmentation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Lecture 22: Evaluation April 24, 2010.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Evaluating Hypotheses
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Jeremy Wyatt Thanks to Gavin Brown
Lecture 10 Comparison and Evaluation of Alternative System Designs.
Chapter 9.2 ROC Curves How does this relate to logistic regression?
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Chapter 9.2 ROC Curves How does this relate to logistic regression?
INTRODUCTION TO Machine Learning 3rd Edition
Evaluation of Learning Models
Theses slides are based on the slides by
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
AM Recitation 2/10/11.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Evaluation – next steps
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
PARAMETRIC STATISTICAL INFERENCE
Copyright © 2003, SAS Institute Inc. All rights reserved. Cost-Sensitive Classifier Selection Ross Bettinger Analytical Consultant SAS Services.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Experimental Evaluation of Learning Algorithms Part 1.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
1 Performance Measures for Machine Learning. 2 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall –F –Break Even Point.
Professor William H. Press, Department of Computer Science, the University of Texas at Austin1 Opinionated in Statistics by Bill Press Lessons #50 Binary.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Evaluation of learned models Kurt Driessens again with slides stolen from Evgueni Smirnov and Hendrik Blockeel.
ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
7. Performance Measurement
Evolving Decision Rules (EDR)
Evaluating Results of Learning
Machine Learning Week 10.
Experiments in Machine Learning
Learning Algorithm Evaluation
Dr. Sampath Jayarathna Cal Poly Pomona
Roc curves By Vittoria Cozza, matr
Dr. Sampath Jayarathna Cal Poly Pomona
More on Maxent Env. Variable importance:
Presentation transcript:

Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council, Ottawa

Classifiers A classifier assigns an object to one of a predefined set of categories or classes. Examples: –A metal detector either sounds an alarm or stays quiet when someone walks through. –A credit card application is either approved or denied. –A medical test’s outcome is either positive or negative. This talk: only two classes, “positive” and “negative”.

Two Types of Error False negative (“miss”), FN alarm doesn’t sound but person is carrying metal False positive (“false alarm”), FP alarm sounds but person is not carrying metal

2-class Confusion Matrix Reduce the 4 numbers to two rates true positive rate = TP = (#TP)/(#P) false positive rate = FP = (#FP)/(#N) Rates are independent of class ratio* True class Predicted class positivenegative positive (#P)#TP#P - #TP negative (#N)#FP#N - #FP * subject to certain conditions

Example: 3 classifiers True Predicted posneg pos6040 neg2080 True Predicted posneg pos7030 neg50 True Predicted posneg pos4060 neg3070 Classifier 1 TP = 0.4 FP = 0.3 Classifier 2 TP = 0.7 FP = 0.5 Classifier 3 TP = 0.6 FP = 0.2

Assumptions Standard Cost Model –correct classification costs 0 –cost of misclassification depends only on the class, not on the individual example –over a set of examples costs are additive Costs or Class Distributions: –are not known precisely at evaluation time –may vary with time –may depend on where the classifier is deployed True FP and TP do not vary with time or location, and are accurately estimated.

How to Evaluate Performance ? Scalar Measures –Accuracy –Expected cost –Area under the ROC curve Visualization Techniques –ROC curves –Cost Curves

What’s Wrong with Scalars ? A scalar does not tell the whole story. –There are fundamentally two numbers of interest (FP and TP), a single number invariably loses some information. –How are errors distributed across the classes ? –How will each classifier perform in different testing conditions (costs or class ratios other than those measured in the experiment) ? A scalar imposes a linear ordering on classifiers. –what we want is to identify the conditions under which each is better.

What’s Wrong with Scalars ? A table of scalars is just a mass of numbers. –No immediate impact –Poor way to present results in a paper –Equally poor way for an experimenter to analyze results Some scalars (accuracy, expected cost) require precise knowledge of costs and class distributions. –Often these are not known precisely and might vary with time or location of deployment.

Why visualize performance ? Shape of curves more informative than a single number Curve informs about –all possible misclassification costs* –all possible class ratios* –under what conditions C1 outperforms C2 Immediate impact (if done well) * subject to certain conditions

Example: 3 classifiers True Predicted posneg pos6040 neg2080 True Predicted posneg pos7030 neg50 True Predicted posneg pos4060 neg3070 Classifier 1 TP = 0.4 FP = 0.3 Classifier 2 TP = 0.7 FP = 0.5 Classifier 3 TP = 0.6 FP = 0.2

Ideal classifier chance always negative always positive ROC plot for the 3 Classifiers

Dominance

Operating Range Slope indicates the class distributions and misclassification costs for which the classifier is better than always-negative ditto for always-positive

Convex Hull Slope indicates the class distributions and misclassification costs for which the red classifier is the same as the blue one.

Creating an ROC Curve A classifier produces a single ROC point. If the classifier has a “sensitivity” parameter, varying it produces a series of ROC points (confusion matrices). Alternatively, if the classifier is produced by a learning algorithm, a series of ROC points can be generated by varying the class ratio in the training set.

ROC Curve

What’s Wrong with ROC Curves ?

When to switch from C4.5 to IB1 ? What is the performance difference ? When to use the default classifiers ? ROC curves for two classifiers. How to tell if two ROC curves’ difference is statistically significant ?

How to average them? How to compute a confidence interval for the average ROC curve ? ROC curves from two cross-validation runs.

And we would like be able to answer all these questions by visual inspection …

Cost Curves

Cost Curves (1) Error Rate Probability of Positive P(+) Classifier 1 TP = 0.4 FP = 0.3 Classifier 2 TP = 0.7 FP = 0.5 Classifier 3 TP = 0.6 FP = 0.2 FP FN = 1-TP

Cost Curves (2) Error Rate Probability of Positive P(+) “always negative”“always positive” Operating Range

Lower Envelope Error Rate Probability of Positive P(+)

Cost Curves Error Rate Probability of Positive P(+) “always negative”“always positive”

Taking Costs Into Account Y = FN X + FP (1-X) So far, X = p(+), making Y = error rate Y = expected cost normalized to [0,1] X = p(+) C(-|+) p(+) C(-|+) + (1-p(+)) C(+|-)

Comparing Cost Curves

Averaging ROC Curves

Averaging Cost Curves

Cost Curve Avg. in ROC Space

Confidence Intervals True Predicted posneg pos7822 neg4060 Original TP = 0.78 FP = 0.4 Predicted negpos True 6238neg 1783pos Resample #2 TP = 0.83 FP = 0.38 Resample confusion matrix times and take 95% envelope Resample #1 TP = 0.75 FP = 0.45 Predicted negpos True 5545neg 2575pos

Confidence Interval Example

Paired Resampling to Test Statistical Significance Predicted by Classifier1 Predicted by Classifier2 posneg pos3010 neg060 For the 100 test examples in the negative class: FP for classifier1: (30+10)/100 = 0.40 FP for classifier2: (30+0)/100 = 0.30 FP2 – FP1 = Resample this matrix times to get (FP2-FP1) values. Do the same for the matrix based on positive test examples. Plot and take 95% envelope as before.

Paired Resampling to Test Statistical Significance classifier1 classifier2 FN2-FN1 FP2-FP1

Correlation between Classifiers Predicted by Classifier1 Predicted by Classifier2 posneg pos3010 neg060 High Correlation Low Correlation (same FP1 and FP2 as above) Predicted by Classifier1 Predicted by Classifier2 posneg pos040 neg30

Low correlation = Low significance classifier1 classifier2 FN2-FN1 FP2-FP1

Limited Range of Significance

Better Data Analysis

ROC, C4.5 Splitting Criteria

Cost Curve, C4.5 Splitting Criteria

ROC, Selection procedure Suppose this classifier was produced by a training set with a class ratio of 10:1, and was used whenever the deployment situation had a 10:1 class ratio.

Cost Curves, Selection Procedure

ROC, Many Points

Cost Curves, Many Lines

Conclusions Scalar performance measures should not be used if costs and class distributions are not exactly known or might vary with time or location. Cost curves enable easy visualization of –Average performance (expected cost) –operating range –confidence intervals on performance –difference in performance and its significance

Fin Cost curve software is available. Contact: Thanks to Alberta Ingenuity Centre for Machine Learning (