1 CSI 5388: ROC Analysis (Based on ROC Graphs: Notes and Practical Considerations for Data Mining Researchers by Tom Fawcett, (Unpublished) January 2003.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Learning Algorithm Evaluation
Evaluation of segmentation. Example Reference standard & segmentation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
ROC Statistics for the Lazy Machine Learner in All of Us Bradley Malin Lecture for COS Lab School of Computer Science Carnegie Mellon University 9/22/2005.
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Decision Theory Naïve Bayes ROC Curves
ROC Curves.
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Jeremy Wyatt Thanks to Gavin Brown
Experimental Evaluation
The Analysis of Variance
ROC Curves.
Introduction to Machine Learning Approach Lecture 5.
Decision Tree Models in Data Mining
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Computer Vision Lecture 8 Performance Evaluation.
Evaluating Classifiers
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Evaluation – next steps
Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing.
ROC 1.Medical decision making 2.Machine learning 3.Data mining research communities A technique for visualizing, organizing, selecting classifiers based.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Experimental Evaluation of Learning Algorithms Part 1.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
CpSc 810: Machine Learning Evaluation of Classifier.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Issues concerning the interpretation of statistical significance tests.
Biostatistics in Practice Peter D. Christenson Biostatistician Session 3: Testing Hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
Evaluating Classification Performance
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Meta-learning for Algorithm Recommendation Meta-learning for Algorithm Recommendation Background on Local Learning Background on Algorithm Assessment Algorithm.
ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
7. Performance Measurement
Evolving Decision Rules (EDR)
Evaluating Classifiers
CSI5388: A Critique of our Evaluation Practices in Machine Learning
Data Mining Classification: Alternative Techniques
Features & Decision regions
Learning Algorithm Evaluation
Computational Intelligence: Methods and Applications
Data Mining Class Imbalance
Roc curves By Vittoria Cozza, matr
Presentation transcript:

1 CSI 5388: ROC Analysis (Based on ROC Graphs: Notes and Practical Considerations for Data Mining Researchers by Tom Fawcett, (Unpublished) January 2003.

2 Evaluating Classification Systems There are two issues in the evaluation of classification systems: There are two issues in the evaluation of classification systems: What evaluation measure should we use?What evaluation measure should we use? How can we ensure that the estimate we obtain is reliable?How can we ensure that the estimate we obtain is reliable? I will briefly touch upon the second question, but the purpose of this lecture is to discuss the first question. In particular, I will introduce concepts in ROC Analysis. I will briefly touch upon the second question, but the purpose of this lecture is to discuss the first question. In particular, I will introduce concepts in ROC Analysis.

3 How can we ensure that the estimates we obtain are reliable? There are different ways to address this issue: There are different ways to address this issue: If the data sets is very large, it is sometimes sufficient to use the hold-out method [though it is suggested to repeat it several times to confirm the results].If the data sets is very large, it is sometimes sufficient to use the hold-out method [though it is suggested to repeat it several times to confirm the results]. The most common method is 10-fold cross-validation. It often gives a reliable enough estimate.The most common method is 10-fold cross-validation. It often gives a reliable enough estimate. For more reliability (but also more computations), the leave-one-out method is sometimes more appropriateFor more reliability (but also more computations), the leave-one-out method is sometimes more appropriate To assess the reliability of our results, we can report on the variance of the results (to notice overlaps) and performed paired t-tests (or other such statistical tests) To assess the reliability of our results, we can report on the variance of the results (to notice overlaps) and performed paired t-tests (or other such statistical tests)

4 Common Evaluation Measures 1. Confusion Matrix PositiveNegative Yes True Positives (TP) False Positives (FP) No False Negatives (FN) True Negatives (TN) Column Totals PN True Class Hypothe- Sized Class

5 Common Evaluation Measures 2. Accuracy, Precision, Recall, etc… FP Rate = FP/N (False Alarm Rate) FP Rate = FP/N (False Alarm Rate) Precision = TP/(TP+FP) Precision = TP/(TP+FP) Accuracy = (TP+TN)/(P+N) Accuracy = (TP+TN)/(P+N) TP Rate = TP/P = Recall = Hit Rate = Sensitivity TP Rate = TP/P = Recall = Hit Rate = Sensitivity F-Score = Precision * Recall (though a number of other formulas are also acceptable) F-Score = Precision * Recall (though a number of other formulas are also acceptable)

6 Common Evaluation Measures 3. Problem with these Measures They describe the state of affairs at a fixed point in a larger evaluation space. They describe the state of affairs at a fixed point in a larger evaluation space. We could get a better grasp on the performance of our learning system if we could judge its behaviour at more than a single point in that space. We could get a better grasp on the performance of our learning system if we could judge its behaviour at more than a single point in that space. For that, we should consider the ROC Space For that, we should consider the ROC Space

7 What does it mean to consider a larger evaluation space? (1) Often, classifiers (e.g., Decision Trees, Rule Learning systems) only issue decisions: true or false. Often, classifiers (e.g., Decision Trees, Rule Learning systems) only issue decisions: true or false.  There is no evaluation space to speak of: we can only judge the value of the classifier’s decision.  There is no evaluation space to speak of: we can only judge the value of the classifier’s decision. WRONG!!! Inside these classifiers, there is a continuous measure that gets pitted against a threshold in order for a decision to be made. WRONG!!! Inside these classifiers, there is a continuous measure that gets pitted against a threshold in order for a decision to be made. If we could get to that inner process, we could estimate the behaviour of our system in a larger space If we could get to that inner process, we could estimate the behaviour of our system in a larger space

8 What does it mean to consider a larger evaluation space? (2) But why do we care about the inner process? But why do we care about the inner process? Well, the classifier’s decision relies on two separate processes. 1) The modeling of the data distribution; 2) The decision based on that modeling. Let’s take the case where (1) is done very reliably, but (2) is badly done. In that case, we would end up with a bad classifier even though the most difficult part of the job (1) was well done. Well, the classifier’s decision relies on two separate processes. 1) The modeling of the data distribution; 2) The decision based on that modeling. Let’s take the case where (1) is done very reliably, but (2) is badly done. In that case, we would end up with a bad classifier even though the most difficult part of the job (1) was well done. It is useful to separate (1) from (2) so that (1), the most critical part of the process, can be estimated reliably. As well, if necessary, (2) can easily be modified and improved. It is useful to separate (1) from (2) so that (1), the most critical part of the process, can be estimated reliably. As well, if necessary, (2) can easily be modified and improved.

9 A Concrete Look at the issue: The Neural Network Case (1) In a Multiple Layered Perceptron (MLP), we expect the output unit to issue 1 if the example is positive and 0, otherwise. In a Multiple Layered Perceptron (MLP), we expect the output unit to issue 1 if the example is positive and 0, otherwise. However, in practice, this is not what happens. The MLP issues a number between 0 and 1 which the user interprets to be 0 or 1. However, in practice, this is not what happens. The MLP issues a number between 0 and 1 which the user interprets to be 0 or 1. Usually, this is done by setting a threshold at.5 so that everything above.5 is positive and everything below.5 is negative. Usually, this is done by setting a threshold at.5 so that everything above.5 is positive and everything below.5 is negative. However, this may be a bad threshold. Perhaps we would be better off considering a.75 threshold or a.25 one. However, this may be a bad threshold. Perhaps we would be better off considering a.75 threshold or a.25 one.

10 A Concrete Look at the issue: The Neural Network Case (2) Please, note that a.75 threshold would amount to decreasing the number of false positives at the expense of false negatives. Conversely, a threshold of.25 would amount to decreasing the number of false negative, this time at the expense of false positives. Please, note that a.75 threshold would amount to decreasing the number of false positives at the expense of false negatives. Conversely, a threshold of.25 would amount to decreasing the number of false negative, this time at the expense of false positives. ROC Spaces allow us to explore such thresholds on a continuous basis. ROC Spaces allow us to explore such thresholds on a continuous basis. They provide us with two advantages: 1) They can tell us what the best spot for our threshold is (given where our priority is in terms of sensitivity to one type over the other type of error) and 2) They allow us to see graphically the behaviour of our system over the whole range of possible tradeoffs. They provide us with two advantages: 1) They can tell us what the best spot for our threshold is (given where our priority is in terms of sensitivity to one type over the other type of error) and 2) They allow us to see graphically the behaviour of our system over the whole range of possible tradeoffs.

11 A Concrete Look at the Issue: Decision Trees Unlike an MLP, a Decision Tree only returns a class label. However, we can ask how this label was computed internally. Unlike an MLP, a Decision Tree only returns a class label. However, we can ask how this label was computed internally. It was computed by considering the proportion of instances of both classes at the leaf node the example fell in. The decision simply corresponds to the most prevalent class. It was computed by considering the proportion of instances of both classes at the leaf node the example fell in. The decision simply corresponds to the most prevalent class. Rule learners use similar statistics on rule confidences and the confidence of a rule matching an instance. Rule learners use similar statistics on rule confidences and the confidence of a rule matching an instance. There does, however, exist systems whose process cannot be translated into a score. For these systems, a score can be generated from an aggregation process.  But is that what we really want to do??? There does, however, exist systems whose process cannot be translated into a score. For these systems, a score can be generated from an aggregation process.  But is that what we really want to do???

12 ROC Analysis Now that we see how we can get a score rather than a decision from various classifiers, let’s look at ROC Analysis per se. Now that we see how we can get a score rather than a decision from various classifiers, let’s look at ROC Analysis per se. Definition: ROC Graphs are two- dimensional graphs in which the TP Rate is plotted on the Y Axis and the FP Rate is plotted on the X Axis. A ROC graph depicts relative tradeoffs between benefits (true positives) and costs (false positives) Definition: ROC Graphs are two- dimensional graphs in which the TP Rate is plotted on the Y Axis and the FP Rate is plotted on the X Axis. A ROC graph depicts relative tradeoffs between benefits (true positives) and costs (false positives)

13 Points in a ROC Graph (1) D A B C E 0 FP Rate 1 1 TP R a t e 0 Interesting Points: (0,0): Classifier that never issues a positive classification  No false positive errors but no true positives results (1,1): Classifier that never issues a negative classification  No false negative errors but no true negative results (0,1), D: Perfect classification

14 Points in a ROC Graph (2) D A B C E 0 FP Rate 1 1 TP R a t e 0 Interesting Points: Informally, one point is better than another if it is to the Northwest of the first. Classifiers appearing on the left handside can be thought of as more conservative. Those on the right handside are more liberal in their classification of positive examples. The diagonal y=x corresponds to random guessing

15 ROC Curves (1) If a classifier issues a discrete outcome, then this corresponds to a point in ROC space. If it issue a ranking or a score, then, as discussed previously, it needs a threshold to issue a discrete classification. If a classifier issues a discrete outcome, then this corresponds to a point in ROC space. If it issue a ranking or a score, then, as discussed previously, it needs a threshold to issue a discrete classification. In the continuous case, the threshold can be placed at various points. As a matter of fact, it can be slided from - to +, producing different points in ROC space, which can connect to trace a curve. In the continuous case, the threshold can be placed at various points. As a matter of fact, it can be slided from - to +, producing different points in ROC space, which can connect to trace a curve. This can be done as in the algorithm described on the next slide. This can be done as in the algorithm described on the next slide.

16 ROC Curves (2) L= Set of test instances, f(i)= continuous outcome of classifier, min and max:= smallest and largest values returned by f, increment=the smallest difference between any two f values L= Set of test instances, f(i)= continuous outcome of classifier, min and max:= smallest and largest values returned by f, increment=the smallest difference between any two f values for t=min to max by increment do for t=min to max by increment do FP  0FP  0 TP  0TP  0 for i  L dofor i  L do if f(i)  t then if f(i)  t then if i is a positive example thenif i is a positive example then TP  TP + 1 TP  TP + 1 elseelse FP  FP + 1 FP  FP + 1 Add point (FP/N, TP/P) to ROC CurveAdd point (FP/N, TP/P) to ROC Curve

17 An exemple of two curves in ROC space FP Rate TPRateTPRate

18 ROC Curves: A Few remarks (1) ROC Curves allow us to make observations about the classifier. See example in class (also fig 3 in Fawcett’s paper): the classifier performs better in the conservative region of the space. ROC Curves allow us to make observations about the classifier. See example in class (also fig 3 in Fawcett’s paper): the classifier performs better in the conservative region of the space. Its best accuracy corresponds to threshold.54 rather than.5. ROC Curves are useful in helping find the best threshold which should not necessarily be set at.5. See Example on the next slide Its best accuracy corresponds to threshold.54 rather than.5. ROC Curves are useful in helping find the best threshold which should not necessarily be set at.5. See Example on the next slide

19 ROC Curves: A Few remarks (2) ppppppnnnn yyyyyyyynn Inst Numb True Predict Score If the threshold is set at.5, the classifier will make two Errors. Yet, if it is set at.7, it will make none. This can Clearly be seen on a ROC graph.

20 ROC Curves: Useful Property ROC Graphs are insensitive to changes in class distribution. I.e., if the proportion of positive to negative instances changes in a test set, the ROC Curve will not change. ROC Graphs are insensitive to changes in class distribution. I.e., if the proportion of positive to negative instances changes in a test set, the ROC Curve will not change. That’s because the TP Rate is calculated using only statistics about the positive class while the FP Rate is calculated using only statistics from the negative class. The two are never mixed. That’s because the TP Rate is calculated using only statistics about the positive class while the FP Rate is calculated using only statistics from the negative class. The two are never mixed. This is important in domains where the distribution of the data changes from, say, month to month or place to place (e.g., fraud detection). This is important in domains where the distribution of the data changes from, say, month to month or place to place (e.g., fraud detection).

21 Modification to the Algorithm for Creating ROC Curves The algorithm on slide 16 is inefficient because it slides to the next point by a constant fixed factor. We can make the algorithm more efficient by looking at the outcome of the classifier and processing it dynamically. The algorithm on slide 16 is inefficient because it slides to the next point by a constant fixed factor. We can make the algorithm more efficient by looking at the outcome of the classifier and processing it dynamically. As well, we can, compute some averages in various segments of the curve or remove all concavities in a ROC Curves (Please see section 5 of Fawcett’s paper for a discussion of all these issues). As well, we can, compute some averages in various segments of the curve or remove all concavities in a ROC Curves (Please see section 5 of Fawcett’s paper for a discussion of all these issues).

22 Area under a ROC Curve (AUC) The AUC is a good way to get a score for the general performance of a classifier and to compare it to that of another classifier. The AUC is a good way to get a score for the general performance of a classifier and to compare it to that of another classifier. There are two statistical properties of the AUC: There are two statistical properties of the AUC: The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instanceThe AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance The AUC is closely related to the GINI Index (used in CART): GINI + 1 = 2 * AUCThe AUC is closely related to the GINI Index (used in CART): GINI + 1 = 2 * AUC Note that for specific issues about the performance of a classifier, the AUC is not sufficient and the ROC Curve should be looked at, but generally, the AUC is a reliable measure. Note that for specific issues about the performance of a classifier, the AUC is not sufficient and the ROC Curve should be looked at, but generally, the AUC is a reliable measure.

23 Averaging ROC Curves A single ROC Curve is not sufficient to make conclusions about a classifier since it corresponds to only a single trial and ignores the second question we asked on Slide 2. A single ROC Curve is not sufficient to make conclusions about a classifier since it corresponds to only a single trial and ignores the second question we asked on Slide 2. In order to avoid this problem, we need to average several ROC Curves. There are two averaging techniques: In order to avoid this problem, we need to average several ROC Curves. There are two averaging techniques: Vertical AveragingVertical Averaging Threshold AveragingThreshold Averaging (These will be discussed in class and are discussed in Section 7 of Fawcett’s paper) (These will be discussed in class and are discussed in Section 7 of Fawcett’s paper)

24 Additional Topics (Section 8 of Fawcett’s paper) The ROC Convex Hull The ROC Convex Hull Decision Problems with more than 2 classes Decision Problems with more than 2 classes Combining Classifiers Combining Classifiers Alternative to ROC Graphs: Alternative to ROC Graphs: DET CurvesDET Curves Cost CurvesCost Curves LC IndexLC Index