On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach

Slides:

Advertisements

Similar presentations

Panos Ipeirotis Stern School of Business

Advertisements

The t Test for Two Independent Samples

1/2/2014 (c) 2001, Ron S. Kenett, Ph.D.1 Parametric Statistical Inference Instructor: Ron S. Kenett Course Website:

Multistage Sampling.

Introductory Mathematics & Statistics for Business

STATISTICS HYPOTHESES TEST (III) Nonparametric Goodness-of-fit (GOF) tests Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.

STATISTICS HYPOTHESES TEST (II) One-sample tests on the mean and variance Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA

C82MST Statistical Methods 2 - Lecture 2 1 Overview of Lecture Variability and Averages The Normal Distribution Comparing Population Variances Experimental.

Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION

Quantitative Methods Lecture 3

1 Session 8 Tests of Hypotheses. 2 By the end of this session, you will be able to set up, conduct and interpret results from a test of hypothesis concerning.

STATISTICAL INFERENCE ABOUT MEANS AND PROPORTIONS WITH TWO POPULATIONS

Chapter 7 Sampling and Sampling Distributions

Chapter 17/18 Hypothesis Testing

Hypothesis Test II: t tests

Department of Engineering Management, Information and Systems

On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.

HYPOTHESIS TESTING. Purpose The purpose of hypothesis testing is to help the researcher or administrator in reaching a decision concerning a population.

Chapter 7 Hypothesis Testing

9.4 t test and u test Hypothesis testing for population mean Example : Hemoglobin of 280 healthy male adults in a region: Question: Whether the population.

HYPOTHESIS TESTING.

6. Statistical Inference: Example: Anorexia study Weight measured before and after period of treatment y i = weight at end – weight at beginning For n=17.

(This presentation may be used for instructional purposes)

ABC Technology Project

Contingency Tables Prepared by Yu-Fen Li.

Chapter 16 Goodness-of-Fit Tests and Contingency Tables

Chi-Square and Analysis of Variance (ANOVA)

Active Learning Lecture Slides For use with Classroom Response Systems Comparing Groups: Analysis of Variance Methods.

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Text Categorization.

Hypothesis Tests: Two Independent Samples

Chapter 4 Inference About Process Quality

T- and F-tests Testing hypotheses.

“Students” t-test.

Chapter 15 ANOVA.

Module 17: Two-Sample t-tests, with equal variances for the two populations This module describes one of the most utilized statistical tests, the.

Chapter 8: Introduction to Hypothesis Testing. 2 Hypothesis Testing An inferential procedure that uses sample data to evaluate the credibility of a hypothesis.

25 seconds left…...

Putting Statistics to Work

Statistical Inferences Based on Two Samples

© The McGraw-Hill Companies, Inc., Chapter 10 Testing the Difference between Means and Variances.

Evaluation of precision and accuracy of a measurement

We will resume in: 25 Minutes.

Chapter Thirteen The One-Way Analysis of Variance.

Ch 14 實習(2).

Chapter 11: The t Test for Two Related Samples

Experimental Design and Analysis of Variance

Module 20: Correlation This module focuses on the calculating, interpreting and testing hypotheses about the Pearson Product Moment Correlation.

1 Chapter 20: Statistical Tests for Ordinal Data.

Testing Hypotheses About Proportions

Simple Linear Regression Analysis

Part 13: Statistical Tests – Part /37 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of.

Multiple Regression and Model Building

Nov 7, 2013 Lirong Xia Hypothesis testing and statistical decision theory.

January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.

Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Section 7-2 Estimating a Population Proportion Created by Erin.

4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.

Research Methods for Counselors COUN 597 University of Saint Joseph Class # 8 Copyright © 2015 by R. Halstead. All rights reserved.

Experimental Evaluation

Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.

On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Chapter Nine Hypothesis Testing.

Review and Preview and Basics of Hypothesis Testing

Presentation transcript:

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA Presenter Jiyeon Kim (April14th, 2014)

Introduction How does the researcher choose which classification algorithm to use for a new problem? Comparing the effectiveness of different algorithms on public databases – opportunities or dangers ? Are many comparisons relied on widely shared datasets statistically valid ?

> Candidate Questions Contents 1 Definitions 2 Comparing Algorithms 3 Statistical Validity 4 Conclusions > Candidate Questions

1 Definitions Paired T-Test Hypothesis Testing Significant Level (α) P-value

1/1 Paired T-Test To determine whether two paired sets differ from each other in a significant way Under this assumption - the paired differences are independent and identically normally distributed

1/2 Hypothesis Testing Null Hypothesis (H0) vs. Alternative Hypothesis (H1) Reject the null hypothesis (H0), if the p-value is less than the significance level e.g. In the case of Paired T-test, H0 : There is no difference in two populations. H1 : There is a statistically significant difference.

1/3 Significance Level, α The percentage of the time in which the experimenters make an error Usually, the significance level is chosen to be 0.05 (or equivalently, 5%) A fixed probability of wrongly rejecting the null hypothesis H0, if it is in fact true ( = P(type I error) )

1/4 P-Value The probability of obtaining a test statistic result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true “Reject the null hypothesis (H0) " when the p-value turns out to be less than a certain significance level, often 0.05

2 Comparing Algorithms Empirical validation of Classification Research has serious experimental deficiencies Be careful when making conclusion that a new method is significantly better on well-studied datasets

3 Statistical Validity < Multiplicity Effect > 154 * 0.05 = 7.7 e.g. Assume that you do 154 experiments ( two-tailed, paired t-test) with significant level 0.05 You have 154 chances to be significant The expected number of significant results is 154 * 0.05 = 7.7 Now You have 770 % error rate !!

< Bonferroni Adjustment > 3 Statistical Validity < Bonferroni Adjustment > Let α* be the error rate of each experiment Then, (1- α*) become the chance that we can get right conclusion ③ If we conduct n independent experiments, the chance of getting them all right is (1- α*)ⁿ So, the chance that we will make at least one mistake, α is 1 - (1- α*)ⁿ

Now You have 99.96% error rate !! 3 Statistical Validity < Bonferroni Adjustment > e.g. (This is not correct usage!!) Assume that you do 154 experiments ( two-tailed, paired t-test ) with significant level 0.05 again The significance level for each experiment, α* = 0.05 Then, the right conclusion rate, (1 - α*) = (1 - 0.05) = 0.95 The chance of getting them all right is (1 - 0.05) ^154 So, the significance level for all experiments is 1 - (1 - α*)ⁿ = 1 - (1 – 0.05)^154 = 0.99996 Now You have 99.96% error rate !!

< Bonferroni Adjustment > 3 Statistical Validity < Bonferroni Adjustment > ⇒ “ Then, what should we do? ” e.g. (This is ‘correct’ usage!!) α = 1 - (1 – α*)^154 ≤ 0.05 in order to obtain results significant at the 0.05 level with 154 results Then, it gives α*≤ 0.0003 which is more stringent than the original significance level 0.05!

This argument is very rough 3 Statistical Validity < Bonferroni Adjustment > / CAVEAT / This argument is very rough because it assumes that all the experiments are independent of one another!

( with Bonferroni Adjustment ) 3/1 Alternative statistical tests Statistical Validity / * Recommended Tests Simple Binomial Test ANOVA(Analysis of Variance) ( with Bonferroni Adjustment )

3/1 Alternative statistical tests To compare two algorithms ( A&B ), Statistical Validity / To compare two algorithms ( A&B ), a comparison must consider four numbers ; The number of examples that A got right and B got wrong ⇒ A>B ② The number of examples that B got right and A got wrong ⇒ A>B ③ The number that both algorithms got right ④ The number that both algorithms got wrong

3/1 Alternative statistical tests To compare two algorithms ( A&B ), Statistical Validity / To compare two algorithms ( A&B ), a comparison must consider four numbers ; The number of examples that A got right and B got wrong ⇒ A>B ② The number of examples that B got right and A got wrong ⇒ A>B ⇒ simple but much improved way, Binomial Test!

3/2 Community Experiments Statistical Validity / Even when using strict significance criteria and the appropriate significance tests, there would be mere ‘ accidents of chance ’ In order to deal with this phenomenon, the most helpful resolution is duplication !

3/3 Repeated Tuning Algorithms are tuned repeatedly on some datasets Statistical Validity / Algorithms are tuned repeatedly on some datasets Whenever tuning takes place, every adjustment should be considered a separate experiment e.g. If 10 ‘ tuning ’ experiments were attempted, then significance level should be 0.005 instead of 0.05

< Recommended Approach> 3/3 Repeated Tuning Statistical Validity / < Recommended Approach> To establish the new algorithm’s comparative merits, Choose other algorithm that is most similar to the new one to include in the comparison Choose a benchmark data set that illustrates the strengths of the new algorithm Divide the data set into k subsets for cross-validation Run a cross-validation To compare algorithms, use the appropriate statistical test

< Cross-Validation> 3/3 Repeated Tuning Statistical Validity / < Cross-Validation> For each of the k subsets of the data set D, create a training set T = D - k (B) Divide each training set into two smaller subsets, T1 and T2 ; T1 will be used for training, and T2 for tuning (C) Once the parameters are optimized, re-run training on the larger set T (D) Finally, measure accuracy on k (E) Overall accuracy is averaged across all k partitions ; These k values also give an estimate of the variance of the algorithms

4 Conclusions No single technique is likely to work best on all databases Empirical comparisons should be done for validity of algorithms but these studies must be very careful! Comparative work should be done in a statistically acceptable framework The contents above are to help experimental researchers steer clear of problems in designing a comparative study.

> Exam Questions Q) Why should we apply Bonferroni Adjustment to comparing classifiers? 1

> Exam Questions A) In case of multiple tests, multiplicity effect occurs if we use same significant level for each test as for all tests. So we need to get more stringent level for each experiment by Bonferroni Adjustment. 1

> Exam Questions Assume that you will do 10 experiments for comparing two classification algorithms. Using Bonferroni Adjustment, determine the criterion of α* (the significant level for each experiment) in order to get results that are truly significant at the 0.01 level for 10 tests. 2

2 > Exam Questions α = 1 - (1 - α*)^10 = 1 - (1 - α*)^10 ≤ 0.01 (1 - α*)^10 ≥ 0.99 1 - α* ≥ 0.9989 ∴ α* ≤ 0.0011 2

> Exam Questions Q) Specify the difference between paired t-test and simple binomial test in comparing two algorithms. 3

3 > Exam Questions A) Paired t-test : determine whether the difference between two algorithms exists or not Binomial test : compare the percentage of times ‘ algorithm A > algorithm B ’ versus ‘ A < B ’, with throwing out the ties 3

Thank You. 감사합니다.