ROC Curves and Operating Points

Slides:



Advertisements
Similar presentations
Classification.. continued. Prediction and Classification Last week we discussed the classification problem.. – Used the Naïve Bayes Method Today..we.
Advertisements

Diagnostic Metrics Week 2 Video 3. Different Methods, Different Measures  Today we’ll continue our focus on classifiers  Later this week we’ll discuss.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Learning Algorithm Evaluation
Evaluation of segmentation. Example Reference standard & segmentation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Curva ROC figuras esquemáticas Curva ROC figuras esquemáticas Prof. Ivan Balducci FOSJC / Unesp.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Classification and risk prediction
Model Evaluation Metrics for Performance Evaluation
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
ROC & AUC, LIFT ד"ר אבי רוזנפלד.
ROC Curve and Classification Matrix for Binary Choice Professor Thomas B. Fomby Department of Economics SMU Dallas, TX February, 2015.
Notes on Measuring the Performance of a Binary Classifier David Madigan.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
Evaluation – next steps
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
EMBC2001 Using Artificial Neural Networks to Predict Malignancy of Ovarian Tumors C. Lu 1, J. De Brabanter 1, S. Van Huffel 1, I. Vergote 2, D. Timmerman.
CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
Information Retrieval Performance Measurement Using Extrapolated Precision William C. Dimm DESI VI June 8, 2015.
ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.
Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.
Performance Indices for Binary Classification 張智星 (Roger Jang) 多媒體資訊檢索實驗室 台灣大學 資訊工程系.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
7. Performance Measurement
Evaluating Classifiers
Evaluation – next steps
Kenneth I. Aston, Ph. D. , Philip J. Uren, Ph. D. , Timothy G
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Performance Evaluation 02/15/17
7CCSMWAL Algorithmic Issues in the WWW
CSSE463: Image Recognition Day 11
Evaluating Results of Learning
Performance Measures II
Support Vector Machines (SVM)
Data Mining Classification: Alternative Techniques
Advanced Analytics. Advanced Analytics What is Machine Learning?
CSSE463: Image Recognition Day 11
The receiver operating characteristic (ROC) curve
Evaluating Classifiers (& other algorithms)
Kenneth I. Aston, Ph. D. , Philip J. Uren, Ph. D. , Timothy G
Approaching an ML Problem
ROC Curves and Operating Points
Model Evaluation and Selection
Evaluating Models Part 1
Overfitting and Underfitting
Intro to Machine Learning
Evaluating Classifiers
CSSE463: Image Recognition Day 11
Roc curves By Vittoria Cozza, matr
CSSE463: Image Recognition Day 11
Evaluating Classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
More on Maxent Env. Variable importance:
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Evaluation Metrics CS229 Anand Avati.
Outlines Introduction & Objectives Methodology & Workflow
Introduction to Intelligent Experiences
Logistic Regression Geoff Hulten.
Presentation transcript:

ROC Curves and Operating Points Geoff Hulten

Which Model should you use? False Positive Rate False Negative Rate Model 1 41% 3% Model 2 5% 25% Mistakes have different costs: Disease Screening – LOW FN Rate Spam filtering – LOW FP Rate Conservative vs Aggressive settings: The same application might need multiple tradeoffs Actually the same model - different thresholds

Classifications and Probability Estimates Logistic regression produces a score between 0 – 1 (probability estimate) Use threshold to produce classification What happens if you vary the threshold?

Example of Changing Thresholds False Positive Rate 33% False Negative Rate 0% Score Prediction Y .25 .45 .55 1 .67 .82 .95 Threshold = .6 False Positive Rate 33% False Negative Rate Threshold = .7 False Positive Rate 0% False Negative Rate 33%

ROC Curve (Receiver Operating Characteristic) Percent of 0s classified as 1 Perfect score: 0% of 1s called 0 0% of 0s called 1 This model’s distance from perfect Sweep threshold from 0 to 1 Threshold 0: ‘all’ classified as 1 Threshold 1: ‘all’ classified as 0 Percent of 1s classified as 0

Operating Points Threshold .04 Threshold .05 Threshold .03 2) Target FN Rate 3) Explicit cost: FP costs 10 FN costs 1 Threshold: 0.8 - 5FPs + 60FNs  110 cost Threshold: 0.83 - 4FPs + 65FNs  105 cost Interpolate between nearest measurements: - To achieve 30% FPR, use threshold of ~0.045 Threshold: 0.87 - 3FPs + 90FNs  120 cost 1) Target FP Rate

Pattern for using operating points Slight changes lead to drift: Today - threshold .9 -> 60% FNR Tomorrow - threshold .9 -> 62% FNR Update thresholds more often than model (?) Pattern for using operating points # Train model and tune parameters on training and validation data # Evaluate model on extra holdout data, reserved for threshold setting ( xThreshold, yThreshold ) = ReservedData() # Find threshold that achieves operating point on this extra holdout data potentialThresholds = {} for t in range [ 1% - 100%]: potentialThresholds[t] = FindFPRate(model.Predict(xThreshold, t), yThreshold) bestThreshold = FindClosestThreshold(<targetFPRate>, potentialThresholds) # or interpolate # Evaluate on validation data with selected threshold to estimate generalization performance performanceAtOperatingPoint = FindFNRate(model.Predict(xValidate, bestThreshold), yValidate) # make sure nothing went crazy… if FindFPRate(model.Predict(xValidate, bestThreshold), yValidate) <far from> potentialThresholds[bestThreshold]: # Problem?

Comparing Models with ROC Curves Model 1 better than Model 2 at every FPR or FNR target Area Under Curve (AUC) Integrate Area under the curve Perfect score is 1 Higher scores allow for generally better tradeoffs AUC of 0.5 indicates model is essentially randomly guessing AUC of < 0.5 indicates you’re doing something wrong… Model 1 better than Model 2 at this FNR Model 1: AUC ~97 Model 2: AUC ~89.5 Model 1 better than Model 2 at this FPR

More ROC Comparisons Model 1: AUC ~97 Model 2: AUC ~93.5 Model 1 better at low FNR ROC Curves Cross Model 2 better at low FPR Zoom in on region of interest Model 2: Worse AUC May be better at what you need… Model 1: AUC ~97 Model 2: AUC ~93.5

Precision Recall Curves – PR Curves Incremental Classifications More Accurate: Threshold continuing lower 95%+ of new 1s are actually 1s High Precision Threshold near 1: Everything classified as 1 is a 1 But many 1s are classified as 0 Incremental Classifications Less Accurate: Most new 1s are actually 0s Low Recall Near perfect recall: All 1s classified as 1s Precision around 25% Sliding threshold lower: Classify a bunch of 0s as 1s Structure indicative of a sparse, strong feature Threshold set to 0: Everything Classified as 1

ROC Curves & Operating Points Summary Vary the threshold used to turn score into classification to vary tradeoff between types of mistakes ROC curves allow: visualization of all tradeoffs a model can make Comparison of models across types of tradeoffs AUC – an aggregate measure of quality across tradeoffs Operating points are the specific tradeoff your application needs Choose threshold to achieve target operating point using hold out data Reset threshold as things change to avoid drift More data Different modeling New features Etc…