Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
Learning Algorithm Evaluation
Evaluation of segmentation. Example Reference standard & segmentation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
ROC Statistics for the Lazy Machine Learner in All of Us Bradley Malin Lecture for COS Lab School of Computer Science Carnegie Mellon University 9/22/2005.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluating Hypotheses
ROC Curves.
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Experimental Evaluation
ROC Curve and Classification Matrix for Binary Choice Professor Thomas B. Fomby Department of Economics SMU Dallas, TX February, 2015.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Measurement and Data Quality
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
Statistics in Screening/Diagnosis
© 2011 Pearson Prentice Hall, Salkind. Introducing Inferential Statistics.
Evaluation – next steps
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Experimental Evaluation of Learning Algorithms Part 1.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
1 Chapter 9 Hypothesis Testing. 2 Chapter Outline  Developing Null and Alternative Hypothesis  Type I and Type II Errors  Population Mean: Known 
1 Risk Assessment Tests Marina Kondratovich, Ph.D. OIVD/CDRH/FDA March 9, 2011 Molecular and Clinical Genetics Panel for Direct-to-Consumer (DTC) Genetic.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Issues concerning the interpretation of statistical significance tests.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Computer and Robot Vision II Chapter 20 Accuracy Presented by: 傅楸善 & 王林農 指導教授 : 傅楸善 博士.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
CHAPTER OVERVIEW Say Hello to Inferential Statistics The Idea of Statistical Significance Significance Versus Meaningfulness Meta-analysis.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
© Copyright McGraw-Hill 2004
Evaluating Classification Performance
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 9 Testing a Claim 9.2 Tests About a Population.
ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.
Data Analytics CMIS Short Course part II Day 1 Part 4: ROC Curves Sam Buttrey December 2015.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Sensitivity, Specificity, and Receiver- Operator Characteristic Curves 10/10/2013.
© 2010 Pearson Prentice Hall. All rights reserved Chapter Hypothesis Tests Regarding a Parameter 10.
Chapter 8 Introducing Inferential Statistics.
Evolving Decision Rules (EDR)
An Empirical Comparison of Supervised Learning Algorithms
Hypothesis Testing I The One-sample Case
Evaluating Results of Learning
Null Hypothesis Testing
Data Mining Classification: Alternative Techniques
CSE 4705 Artificial Intelligence
Using statistics to evaluate your test Gerard Seinhorst
Computational Intelligence: Methods and Applications
Parametric Methods Berlin Chen, 2005 References:
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
Roc curves By Vittoria Cozza, matr
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
Presentation transcript:

Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing new evaluation Constructing new evaluation measures that combine metric measures that combine metric and statistical information and statistical information

Part I Borrowing new performance evaluation measures from the medical diagnostic community (Marina Sokolova, Nathalie Japkowicz and Stan Szpakowicz)

3 The need to borrow new performance measures: an example It has come to our attention that the performance measures commonly used in Machine Learning are not very good at assessing the performance of problems in which the two classes are equally important. It has come to our attention that the performance measures commonly used in Machine Learning are not very good at assessing the performance of problems in which the two classes are equally important.  Accuracy focuses on both classes, but, it does not distinguish between the two classes.  Other measures, such as Precision/Recall, F- Score and ROC Analysis only focus on one class, without concerning themselves with performance on the other class.

4 Learning Problems in which the classes are equally important Examples of recent Machine Learning domains that require equal focus on both classes and a distinction between false positive and false negative rates are: Examples of recent Machine Learning domains that require equal focus on both classes and a distinction between false positive and false negative rates are: opinion/sentiment identification opinion/sentiment identification classification of negotiations classification of negotiations An examples of a traditional problem that requires equal focus on both classes and a distinction between false positive and false negative rates is: An examples of a traditional problem that requires equal focus on both classes and a distinction between false positive and false negative rates is: Medical Diagnostic TestsMedical Diagnostic Tests What measures have researchers in the Medical Diagnostic Test Community used that we can borrow? What measures have researchers in the Medical Diagnostic Test Community used that we can borrow?

5 Performance Measures in use in the Medical Diagnostic Community Common performance measures in use in the Medical Diagnostic Community are: Common performance measures in use in the Medical Diagnostic Community are: Sensitivity/Specificity (also in use in Machine learning)Sensitivity/Specificity (also in use in Machine learning) Likelihood ratiosLikelihood ratios Youden’s IndexYouden’s Index Discriminant PowerDiscriminant Power [Biggerstaff, 2000; Blakeley & Oddone, 1995]

6Sensitivity/Specificity The sensitivity of a diagnostic test is: The sensitivity of a diagnostic test is: P[+|D], i.e., the probability of obtaining a positive test result in theP[+|D], i.e., the probability of obtaining a positive test result in the diseased population. diseased population. The specificity of a diagnostic test is: The specificity of a diagnostic test is: P[-|Ď], i.e., the probability of obtaining a negative test result in theP[-|Ď], i.e., the probability of obtaining a negative test result in the disease-free population. disease-free population. Sensitivity and specificity are not that useful, however, since one, really is interested in P[D|+] (PVP: the Predictive Value of a Positive) and P[Ď|-] (PVN: the Predictive Value of a Negative) in both the medical testing community and in Machine Learning.  We can apply Bayes’ Theorem to derive the PVP and PVN. Sensitivity and specificity are not that useful, however, since one, really is interested in P[D|+] (PVP: the Predictive Value of a Positive) and P[Ď|-] (PVN: the Predictive Value of a Negative) in both the medical testing community and in Machine Learning.  We can apply Bayes’ Theorem to derive the PVP and PVN.

7 Deriving the PVPs and PVNs The problem with deriving the PVP and PVN of a test, is that in order to derive them, we need to know p[D], the pre-test probability of the disease. This cannot be done directly. The problem with deriving the PVP and PVN of a test, is that in order to derive them, we need to know p[D], the pre-test probability of the disease. This cannot be done directly. As usual, however, we can set ourselves in the context of the comparison of two tests (with P[D] being the same in both cases). As usual, however, we can set ourselves in the context of the comparison of two tests (with P[D] being the same in both cases). Doing so, and using Bayes’ Theorem: Doing so, and using Bayes’ Theorem: P[D|+] = (P[+|D] P[D])/(P[+|D] P[D] + P[+|Ď]P[Ď]) We can get the following relationships (see Biggerstaff, 2000): We can get the following relationships (see Biggerstaff, 2000): P[D|+ Y ] > P[D|+ X ] ↔ ρ+ Y > ρ+ XP[D|+ Y ] > P[D|+ X ] ↔ ρ+ Y > ρ+ X P[Ď|- Y ] > P[Ď|- X ] ↔ ρ- Y P[Ď|- X ] ↔ ρ- Y < ρ- X Where X and Y are two diagnostic tests, and + X, and – X stand for confirming the presence of the disease and confirming the absence of the disease, respectively. (and similarly for + Y and – Y ) ρ+ and ρ- are the likelihood ratios that are defined on the next slideρ+ and ρ- are the likelihood ratios that are defined on the next slide

8 Likelihood Ratios ρ+ and ρ- are actually easy to derive. ρ+ and ρ- are actually easy to derive. The likelihood ratio of a positive test is: The likelihood ratio of a positive test is: ρ+ = P[+|D] / P[+| Ď], i.e. the ratio of the trueρ+ = P[+|D] / P[+| Ď], i.e. the ratio of the true positive rate to the false positive rate to the false positive rate. positive rate. The likelihood ratio of a negative test is: The likelihood ratio of a negative test is: ρ- = P[-|D] / P[-| Ď], i.e. the ratio of the false negative rate to the trueρ- = P[-|D] / P[-| Ď], i.e. the ratio of the false negative rate to the true negative rate. negative rate. Note: We want to maximize ρ+ and minimize ρ-. This means that, even though we cannot calculate the PVP and PVN directly, we can get the information we need to compare two tests through the likelihood ratios. This means that, even though we cannot calculate the PVP and PVN directly, we can get the information we need to compare two tests through the likelihood ratios.

9 Youden’s Index and Discriminant Power Youden’s Index measures the avoidance of failure of an algorithm while Discriminant Power evaluates how well an algorithm distinguishes between positive and negative examples. Youden’s Index measures the avoidance of failure of an algorithm while Discriminant Power evaluates how well an algorithm distinguishes between positive and negative examples. Youden’s Index Youden’s Index γ = sensitivity – (1 – specificity) γ = sensitivity – (1 – specificity) = P[+|D] – (1 - P[-|Ď]) = P[+|D] – (1 - P[-|Ď]) Discriminant Power: Discriminant Power: DP = √3/π (log X + log Y), DP = √3/π (log X + log Y), where X = sensitivity/(1 – sensitivity) and where X = sensitivity/(1 – sensitivity) and Y = specificity/(1-specificity) Y = specificity/(1-specificity)

10 Comparison of the various measures on the outcome of e-negotiation MeasureSVM N. Bayes Accuracy F-Score Sensitivity Specificity AUC Youden Pos. Likelihood Neg Likelihood.2.3 Discriminant Power DP is below 3  insignificant

11 What does this all mean? Traditional ML Measures Classifier Overall effectiveness (Accuracy) Predictive Power (Precision) Effectiveness on a class, a- posteriori (sensitivity/ specificity) SVMSuperior Superior on pos examples NBinferior Superior on neg examples

12 What does this all mean? New Measures that are more appropriate for problems where both classes are as important Classifier Avoidance of failure (Youden) Effectiveness on a class, a- priori (Likelihood Ratios) Discrimination of classes (Discriminant Power) SVMInferior Superior on neg examples Limited NBSuperior Superior on pos examples Limited

13 Part I: Discussion The variety of results obtained with the different measures suggest two conclusions: The variety of results obtained with the different measures suggest two conclusions: 1.It is very important for practitioners of Machine Learning to understand their domain deeply, to understand what it is, exactly, that they want to evaluate, and to reach their goal using appropriate measures (existing or new ones). 2.Since some of the results are very close to each other, it is important to establish reliable confidence tests to find out whether or not these results are significant.

14 Part II Constructing new evaluation measures (William Elamzeh, Nathalie Japkowicz and Stan Matwin)

15 Motivation for our new evaluation method ROC Analysis alone and its associated AUC measure do not assess the performance of classifiers adequately since they omit any information regarding the confidence of these estimates. ROC Analysis alone and its associated AUC measure do not assess the performance of classifiers adequately since they omit any information regarding the confidence of these estimates. Though the identification of the significant portion of ROC Curves is an important step towards generating a more useful assessment, this analysis remains biased in favour of the large class, in case of severe imbalances. Though the identification of the significant portion of ROC Curves is an important step towards generating a more useful assessment, this analysis remains biased in favour of the large class, in case of severe imbalances. We would like to combine the information provided by the ROC Curve together with information regarding how balanced the classifier is with regard to the misclassification of positive and negative examples. We would like to combine the information provided by the ROC Curve together with information regarding how balanced the classifier is with regard to the misclassification of positive and negative examples.

16 ROC’s bias in the case of severe class imbalances ROC Curves, for the pos class, plots the true positive rate a/(a+b) against the false positive rate c/(c+d). ROC Curves, for the pos class, plots the true positive rate a/(a+b) against the false positive rate c/(c+d). When the number of pos. examples is significantly lower than the number of neg. examples, a+b << c+d, as we change the class probability threshold, a/(a+b) climbs faster than c/(c+d) When the number of pos. examples is significantly lower than the number of neg. examples, a+b << c+d, as we change the class probability threshold, a/(a+b) climbs faster than c/(c+d)  ROC gives the majority class (-) an unfair advantage. Ideally, a classifier should classify both classes proportionally Ideally, a classifier should classify both classes proportionally Pred+Pred-Total Class+aba+b Class-cdc+d Totala+cb+dn Confusion Matrix

17 Correcting for ROC’s bias in the case of severe class imbalances Though we keep ROC as a performance evaluation measure, since rate information is useful, we propose to favour classifiers that perform with similar number of errors in both classes, for confidence estimation. Though we keep ROC as a performance evaluation measure, since rate information is useful, we propose to favour classifiers that perform with similar number of errors in both classes, for confidence estimation. More specifically,as in the Tango test, we favour classifiers that have lower difference in classification errors in both classes, (b-c)/n. More specifically,as in the Tango test, we favour classifiers that have lower difference in classification errors in both classes, (b-c)/n. This quantity (b-c)/n is interesting not just for confidence estimation, but also as an evaluation measure in its own right This quantity (b-c)/n is interesting not just for confidence estimation, but also as an evaluation measure in its own right Pred+Pred-Total Class+aba+b Class-cdc+d Totala+cb+dn Confusion Matrix

18 Proposed Evaluation Method for severely Imbalanced Data sets Our method consists of five steps: Our method consists of five steps: Generate a ROC Curve R for a classifier K applied to data D.Generate a ROC Curve R for a classifier K applied to data D. Apply Tango’s confidence test in order to identify the confident segments of R.Apply Tango’s confidence test in order to identify the confident segments of R. Compute the CAUC, the area under the confident ROC segment.Compute the CAUC, the area under the confident ROC segment. Compute AveD, the average normalized difference (b-c)/n for all points in the confident ROC segment.Compute AveD, the average normalized difference (b-c)/n for all points in the confident ROC segment. Plot CAUC against aveD  An effective classifier shows low AveD and high CAUCPlot CAUC against aveD  An effective classifier shows low AveD and high CAUC

19 Experiments and Expected Results We considered 6 imbalanced domain from UCI. The most imbalanced one contained only 1.4% examples in the small class while the least imbalanced one had as many as 26%. We considered 6 imbalanced domain from UCI. The most imbalanced one contained only 1.4% examples in the small class while the least imbalanced one had as many as 26%. We ran 4 classifiers: Decision Stumps, Decision Trees, Decision Forests and Naïve Bayes We ran 4 classifiers: Decision Stumps, Decision Trees, Decision Forests and Naïve Bayes We expected the following results: We expected the following results: Weak Performance from the Decision StumpsWeak Performance from the Decision Stumps Stronger Performance from the Decision TreesStronger Performance from the Decision Trees Even Stronger Performance from the Random ForestsEven Stronger Performance from the Random Forests We expected Naïve Bayes to perform reasonably well, but with no idea of how it would compare to the tree family of learnersWe expected Naïve Bayes to perform reasonably well, but with no idea of how it would compare to the tree family of learners Same family of learners

20 Results using our new method: Our expectations are met Decision Stumps perform the worst, followed by decision trees and then random forests (in most cases) Surprise 1: Decision trees outperform random forests in the two most balanced data sets. Surprise 2: Naïve Bayes consistently outperforms Random forests Note: Classifiers in the top left corner outperform those in the bottom right corner

21 AUC Results Our, more informed, results contradict the AUC results which claim that: Our, more informed, results contradict the AUC results which claim that: Decision Stumps are sometimes as good as or superior to decision trees (!) Random Forests outperforms all other systems in all but one cases.

22 Part II: Discussion In order to better understand the performance of classifiers on various domains, it can be useful to consider several aspects of this evaluation simultaneously. In order to better understand the performance of classifiers on various domains, it can be useful to consider several aspects of this evaluation simultaneously. In order to do so, it might be useful to create specific measures adapted to the purpose of the evaluation. In order to do so, it might be useful to create specific measures adapted to the purpose of the evaluation. In our case, above, our evaluation measure allowed us to study the tradeoff between classification difference and area under the confident segment of the AUC curve, thus, producing more reliable results In our case, above, our evaluation measure allowed us to study the tradeoff between classification difference and area under the confident segment of the AUC curve, thus, producing more reliable results