On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Slides:



Advertisements
Similar presentations
On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.
Advertisements

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Part I – MULTIVARIATE ANALYSIS
Using Statistics in Research Psych 231: Research Methods in Psychology.
Evaluation.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 10: Hypothesis Tests for Two Means: Related & Independent Samples.
Introduction to the Analysis of Variance
Evaluating Hypotheses
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 9: One Way ANOVA Between Subjects
Two Groups Too Many? Try Analysis of Variance (ANOVA)
One-way Between Groups Analysis of Variance
Topic 3: Regression.
Experimental Evaluation
Analysis of Variance & Multivariate Analysis of Variance
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
PSY 307 – Statistics for the Behavioral Sciences Chapter 19 – Chi-Square Test for Qualitative Data Chapter 21 – Deciding Which Test to Use.
Today Concepts underlying inferential statistics
5-3 Inference on the Means of Two Populations, Variances Unknown
Statistical Methods in Computer Science Hypothesis Testing II: Single-Factor Experiments Ido Dagan.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
AM Recitation 2/10/11.
F-Test ( ANOVA ) & Two-Way ANOVA
Statistical Hypothesis Testing. Suppose you have a random variable X ( number of vehicle accidents in a year, stock market returns, time between el nino.
Let’s flip a coin. Making Data-Based Decisions We’re going to flip a coin 10 times. What results do you think we will get?
Chapter 15 Data Analysis: Testing for Significant Differences.
Non-parametric Tests. With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
t(ea) for Two: Test between the Means of Different Groups When you want to know if there is a ‘difference’ between the two groups in the mean Use “t-test”.
ANOVA (Analysis of Variance) by Aziza Munir
Experimental Evaluation of Learning Algorithms Part 1.
1 Chapter 13 Analysis of Variance. 2 Chapter Outline  An introduction to experimental design and analysis of variance  Analysis of Variance and the.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Statistics (cont.) Psych 231: Research Methods in Psychology.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Analysis of Variance 1 Dr. Mohammed Alahmed Ph.D. in BioStatistics (011)
Analysis of Variance (ANOVA) Brian Healy, PhD BIO203.
1 Statistical Significance Testing. 2 The purpose of Statistical Significance Testing The purpose of Statistical Significance Testing is to answer the.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
ANOVA P OST ANOVA TEST 541 PHL By… Asma Al-Oneazi Supervised by… Dr. Amal Fatani King Saud University Pharmacy College Pharmacology Department.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
Hypothesis test flow chart frequency data Measurement scale number of variables 1 basic χ 2 test (19.5) Table I χ 2 test for independence (19.9) Table.
Validation methods.
One-way ANOVA Example Analysis of Variance Hypotheses Model & Assumptions Analysis of Variance Multiple Comparisons Checking Assumptions.
Chapter 22 Comparing Two Proportions.  Comparisons between two percentages are much more common than questions about isolated percentages.  We often.
T tests comparing two means t tests comparing two means.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
ANalysis Of VAriance (ANOVA) Used for continuous outcomes with a nominal exposure with three or more categories (groups) Result of test is F statistic.
Statistics (cont.) Psych 231: Research Methods in Psychology.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Step 1: Specify a null hypothesis
Lecture Slides Elementary Statistics Twelfth Edition
Chapter 10 – Part II Analysis of Variance
Presentation transcript:

On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th 2001

Agenda Introduction Classification basics Definitions Statistical validity Bonferroni Adjustment Statistical Accidents Repeated Tuning A Recommended Approach Conclusion

Introduction Comparative studies – proper methodology? Public databases – relied too heavily on them? Comparison results – are they really correct or just statistical accidents?

Definitions T-test F-test P-value Null hypothesis

T-test The t-test assesses whether the means of two groups are statistically different from each other. Ratio of difference in means to variability of groups.

F-test It determines whether the variances of two samples are significantly different. Ratio of variance of two datasets Basis for “Analysis of Variance” (ANOVA)

p-value It represents probability of concluding (incorrectly) that there is a difference in samples when no true difference exists. Dependent upon the statistical test being performed. P = 0.05 means that there is 5% chance that you would be wrong if concluding the populations are different.

NULL Hypothesis Assumption that there is no difference in two or more populations. Any observed difference in samples is due to chance or sampling error.

Statistical Validity Tests Statistics offers many tests that are designed to measure the significance of any difference. Adaptation to classifier comparison should be done carefully.

Bonferroni Adjustment – an example Comparison of classifier algorithms 154 datasets NULL hypothesis is true if p-value is < 0.05 (not very stringent) Differences were reported significant if a t-test produced p-value < 0.05.

Example (cont.) This is not correct usage of p-value significance test. There were 154 experiments. Therefore, 154 chances to be significant. Actual p-value used is 154*0.05 (= 7.7).

Example (cont.) Let the significance for each level be  Chance for making right conclusion for one experiment is 1-  Assuming experiments are independent of one another, chance for getting n experiments correct is (1-  ) n Chances of not making correct conclusion is 1-(1-  ) n

Example (cont.) Substituting  =0.05 Chances for making incorrect conclusion is To obtain results significant at 0.05 level with 154 tests 1-(1-  ) 154 < 0.05 or  < 0.003

Example - conclusion Rough calculations but provides insight to problem The use of wrong p-value results in incorrect conclusions T-test overall is wrong test as training and test sets are not independent

Simple Recommended Statistical Test Comparison must consider 4 numbers when a common test set to compare two algorithms (A and B) A > B A < B A = B ~A = ~B

Simple Recommended Statistical Test (cont.) If only two algorithms compared Throw out ties. Compare A>B Vs A<B If more than two algorithms compared Use “Analysis of Variance” (ANOVA) Bonferroni adjustment for multiple test should be applied

Statistical Accidents Suppose 100 people are studying the effect of algorithms A and B. At least 5 will get results statistically significant at p <= 0.05 (assuming independent experiments). These results are nothing but due to chance.

Repeated Tuning Algorithms are “tuned” repeatedly on same datasets. Every “tuning” attempt should be considered as a separate experiment. For example if 10 tuning experiments were attempted, then p-value should be instead of 0.05.

Repeated Tuning (cont.) Datasets are not independent, therefore even Bonferroni adjustment is not very accurate. A greater problem occurs while using an algorithm that has been used before: you may not know how it was tuned (one disadvantage of using public databases).

Repeated Tuning – Recommended approach Break dataset into k disjoint subsets of approximately equal size. K experiments are performed. After every experiment one subset is removed. Trained system is tested on held-out subsystem.

Repeated Tuning – Recommended approach (cont.) At the end of k-fold experiment, every sample has been used in test set exactly once. Advantage: test sets are independent. Disadvantage: training sets are clearly not independent.

A Recommended Approach Choose other algorithms to include in the comparison. Try including most similar to new algorithm. Choose datasets. Divide the data set into k subsets for cross validation. Typically k=10. For a small data set, choose larger k, since this leaves more examples in the training set.

A Recommended Approach (cont.) Run a cross-validation For each of the k subsets of the data set D, create a training set T = D – k Divide T into T1 (training) and T2 (tuning) subsets Once tuning is done, rerun training on T Finally measure accuracy on k Overall accuracy is averaged across all k partitions.

A Recommended Approach (cont.) Finally, compare algorithms In case of multiple data sets, Bonferroni adjustment should be applied

Conclusion We don’t mean to discourage empirical comparisons but to provide suggestions to avoid pitfalls. Statistical tools should be used carefully. Every details of the experiment should be reported.