1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.

Slides:



Advertisements
Similar presentations
On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.
Advertisements

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Chapter 10: Hypothesis Testing
1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.
Evaluation.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Elementary hypothesis testing
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Chapter 7 Sampling and Sampling Distributions
ANOVA Determining Which Means Differ in Single Factor Models Determining Which Means Differ in Single Factor Models.
Evaluation.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Evaluating Hypotheses
One-way Between Groups Analysis of Variance
Experimental Evaluation
IENG 486 Statistical Quality & Process Control
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Richard M. Jacobs, OSA, Ph.D.
Nonparametric or Distribution-free Tests
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
AM Recitation 2/10/11.
Statistics 11 Hypothesis Testing Discover the relationships that exist between events/things Accomplished by: Asking questions Getting answers In accord.
Statistical Techniques I
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
RESEARCH A systematic quest for undiscovered truth A way of thinking
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Chapter 9 Large-Sample Tests of Hypotheses
Hypothesis Testing Quantitative Methods in HPELS 440:210.
Non-parametric Tests. With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!
Chapter 7 Hypothesis testing. §7.1 The basic concepts of hypothesis testing  1 An example Example 7.1 We selected 20 newborns randomly from a region.
Chapter 9 Power. Decisions A null hypothesis significance test tells us the probability of obtaining our results when the null hypothesis is true p(Results|H.
Experimental Evaluation of Learning Algorithms Part 1.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
1 Chapter 8 Hypothesis Testing 8.2 Basics of Hypothesis Testing 8.3 Testing about a Proportion p 8.4 Testing about a Mean µ (σ known) 8.5 Testing about.
Economics 173 Business Statistics Lecture 4 Fall, 2001 Professor J. Petry
1 Statistical Significance Testing. 2 The purpose of Statistical Significance Testing The purpose of Statistical Significance Testing is to answer the.
Analyzing Statistical Inferences How to Not Know Null.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Section 10.1 Estimating with Confidence AP Statistics February 11 th, 2011.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
Hypothesis Tests. An Hypothesis is a guess about a situation that can be tested, and the test outcome can be either true or false. –The Null Hypothesis.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Inferential Statistics Psych 231: Research Methods in Psychology.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Data Science Credibility: Evaluating What’s Been Learned
Virtual University of Pakistan
Unit 5: Hypothesis Testing
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Hypothesis Testing: Hypotheses
Machine Learning: Lecture 5
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach”]

2 Advantages of Large Data Repositories The researcher can easily experiment with real-world data sets (rather than artificial data). The researcher can easily experiment with real-world data sets (rather than artificial data). New algorithms can be tested in real- world settings.New algorithms can be tested in real- world settings. Since many researchers use the same data sets, the comparison between new and old classifiers is easy.Since many researchers use the same data sets, the comparison between new and old classifiers is easy. Problems arising in real-world settings can be identified and focused on.Problems arising in real-world settings can be identified and focused on.

3 Disadvantages of Large Data Repositories The Multiplicity Effect: when running large numbers of experiments, more stringent requirements need to be used to establish statistical significance than when only a few number of experiments are considered. The Multiplicity Effect: when running large numbers of experiments, more stringent requirements need to be used to establish statistical significance than when only a few number of experiments are considered. Community Experiments Problem: If many researchers run the same experiments, it is possible that, by chance, some people will obtain statistically significant results. These people will be the ones publishing their results (even though they may have been obtained, only by chance!) Community Experiments Problem: If many researchers run the same experiments, it is possible that, by chance, some people will obtain statistically significant results. These people will be the ones publishing their results (even though they may have been obtained, only by chance!) Repeated Tuning Problem: In order to be valid, all tuning should be done before the test set is known. This is seldom the case. Repeated Tuning Problem: In order to be valid, all tuning should be done before the test set is known. This is seldom the case. The Problem of Generalizing Results: It is not necessarily correct to generalize from the UCI Repository to any other data sets. The Problem of Generalizing Results: It is not necessarily correct to generalize from the UCI Repository to any other data sets.

4 The Multiplicity Effect: An Example 14 different algorithms get compared on 11 data sets to a default classifier. 14 different algorithms get compared on 11 data sets to a default classifier. Differences are reported as significant if a two- tailed, paired t-test produces a p-value smaller than Differences are reported as significant if a two- tailed, paired t-test produces a p-value smaller than This is not stringent enough: By running 14*11=154 experiments, one has 154 chances to be significant. So the expected number of significant results obtained by chance is 154*.05= 7.7. This is not stringent enough: By running 14*11=154 experiments, one has 154 chances to be significant. So the expected number of significant results obtained by chance is 154*.05= 7.7. This is not desirable. In fact, in such a setting, the acceptable p-value should be much smaller than 0.05 in order to obtain a true significance level of This is not desirable. In fact, in such a setting, the acceptable p-value should be much smaller than 0.05 in order to obtain a true significance level of 0.05.

5 The Multiplicity Effect: More Formally Let the acceptable significance level for each of our experiments be α*. Let the acceptable significance level for each of our experiments be α*. Then the chance of making the right conclusion for one experiment is Then the chance of making the right conclusion for one experiment is 1- α* If we conduct n independent experiments, the chances of getting them all right is If we conduct n independent experiments, the chances of getting them all right is (1- α*) n (1- α*) n Suppose that there is no real difference among the algorithms being tested, then, the chance that we will make at least one mistake is: Suppose that there is no real difference among the algorithms being tested, then, the chance that we will make at least one mistake is: α = 1- (1- α*) n α = 1- (1- α*) n

6 The Multiplicity Effect: The Example continued We assume that there is no real difference among the 14 algorithms being compared. We assume that there is no real difference among the 14 algorithms being compared. If our acceptable significance level were set to α*= 0.05, then the odds of making at least one mistake in our 154 experiments is If our acceptable significance level were set to α*= 0.05, then the odds of making at least one mistake in our 154 experiments is 1-(1-0.05) 154 = That is, we are 99.96% certain that at least one of our conclusions will incorrectly reach significance at the 0.05 level That is, we are 99.96% certain that at least one of our conclusions will incorrectly reach significance at the 0.05 level If we wanted to reach a true significance level of 0.05, we would need to set If we wanted to reach a true significance level of 0.05, we would need to set 1- (1- α*) 154 ≤0.05, i.e., α* ≤ This is called the Bonferroni adjustment This is called the Bonferroni adjustment

7 Experiment Independence The Bonferroni adjustment is valid as long as the experiments are independent. The Bonferroni adjustment is valid as long as the experiments are independent. However: However: If different algorithms are compared on the same test data, then the test are not independent.If different algorithms are compared on the same test data, then the test are not independent. If the training and testing data are drawn from the same data set, then the experiments are not independent,If the training and testing data are drawn from the same data set, then the experiments are not independent, In these cases, it is even more likely that the statistical tests will find significance when none exists, even if the Bonferroni adjustment were used. In these cases, it is even more likely that the statistical tests will find significance when none exists, even if the Bonferroni adjustment were used. In conclusion, the t-test, often used by researchers, is the wrong test to use in this particular experimental setting. In conclusion, the t-test, often used by researchers, is the wrong test to use in this particular experimental setting.

8 Alternative Statistical Test to deal with the Multiplicity Effect I The first approach suggested as an alternative to deal with the Multiplicity effect is the following: The first approach suggested as an alternative to deal with the Multiplicity effect is the following: When comparing only two algorithms, A and B,When comparing only two algorithms, A and B, We can count the number of examples that A got right and B got wrong (A>B); and the number of examples that B got right and A got wrong (B>A); We can count the number of examples that A got right and B got wrong (A>B); and the number of examples that B got right and A got wrong (B>A); We can then compare the percent of times where A>B and B>A, throwing out the ties We can then compare the percent of times where A>B and B>A, throwing out the ties We can then use a Binomial test (or the McNemar test which is nearly identical and easier to compute) for the comparison, with the Bonferroni adjustment for multiple tests. (See Salzberg, 1997) We can then use a Binomial test (or the McNemar test which is nearly identical and easier to compute) for the comparison, with the Bonferroni adjustment for multiple tests. (See Salzberg, 1997) However, the binomial test does not handle quantitative differences between algorithms, more than two algorithms and it does not consider the frequency of agreements between two algorithms. However, the binomial test does not handle quantitative differences between algorithms, more than two algorithms and it does not consider the frequency of agreements between two algorithms.

9 Alternative Statistical Tests to deal with the Multiplicity Effect II The other approaches suggested as an alternative to deal with the Multiplicity effect are the following: The other approaches suggested as an alternative to deal with the Multiplicity effect are the following: Use random, distinct samples of the data to test each algorithm, and to use an analysis of variance (ANOVA) to compare the results.Use random, distinct samples of the data to test each algorithm, and to use an analysis of variance (ANOVA) to compare the results. Use the following Randomization testing approach:Use the following Randomization testing approach: For each trial, the data set is copied and class labels are replaced with random class labels. For each trial, the data set is copied and class labels are replaced with random class labels. An algorithm is used to find the most accurate classifier it can, using the same methodology that is used with the original data. An algorithm is used to find the most accurate classifier it can, using the same methodology that is used with the original data. Any estimate of accuracy greater than random for the copied data reflects the bias in the methodology, and this reference distribution can then be used to adjust the estimates on the real data. Any estimate of accuracy greater than random for the copied data reflects the bias in the methodology, and this reference distribution can then be used to adjust the estimates on the real data.

10 Community Experiments The multiplicity effect is not the only problem plaguing the current experimental process. The multiplicity effect is not the only problem plaguing the current experimental process. There is another process that can be referred to as the Community experiments effect that occurs even if all the statistical tests are conducted properly. There is another process that can be referred to as the Community experiments effect that occurs even if all the statistical tests are conducted properly. Suppose that 100 different people are trying to compare the accuracy of algorithms A and B, which, we assume have the same mean accuracy on a very large population of data sets. Suppose that 100 different people are trying to compare the accuracy of algorithms A and B, which, we assume have the same mean accuracy on a very large population of data sets. If these 100 people are studying these algorithms and looking for a significance level of.05, then we can expect 5 of these people to, by chance, find a significant difference between A and B. If these 100 people are studying these algorithms and looking for a significance level of.05, then we can expect 5 of these people to, by chance, find a significant difference between A and B. One of these 5 people may publish their results, while the others will move on to different experiments. One of these 5 people may publish their results, while the others will move on to different experiments.

11 How to Deal with Community Experiments The way to guard against the community experiments effect is to duplicate the results. The way to guard against the community experiments effect is to duplicate the results. Proper duplication requires drawing a new random sample from the population and repeating the study. Proper duplication requires drawing a new random sample from the population and repeating the study. Unfortunately, since benchmark databases are static and small in size, it is not possible to draw random samples of the same data sets at random. Unfortunately, since benchmark databases are static and small in size, it is not possible to draw random samples of the same data sets at random. Using a different partitioning of the data into training and test sets does not help either with this problem. Using a different partitioning of the data into training and test sets does not help either with this problem. Should we rely on artificial data sets? Should we rely on artificial data sets?

12 Repeated Tuning Most experiments need tuning. In many cases, the algorithms themselves need tuning, and in most cases, various data representations need to be tried. Most experiments need tuning. In many cases, the algorithms themselves need tuning, and in most cases, various data representations need to be tried. If the results of all this tuning is tested on the same data set as the data set used to report the final results, then each adjustment should be counted as a separate experiment. E.g., if 10 different combinations of parameters are tried, then we would need to consider a significance level of in order to truly reach one of If the results of all this tuning is tested on the same data set as the data set used to report the final results, then each adjustment should be counted as a separate experiment. E.g., if 10 different combinations of parameters are tried, then we would need to consider a significance level of in order to truly reach one of The solution to this problem is to do any kind of parameter tuning, algorithmic tweaking and so on ahead of seeing the testing set. Once a result has been produced from the testing set, it is not possible to get back to it. The solution to this problem is to do any kind of parameter tuning, algorithmic tweaking and so on ahead of seeing the testing set. Once a result has been produced from the testing set, it is not possible to get back to it. This is a problem because it makes exploratory research impossible, if one wants to report statistically significant results. On the other hand, it makes the report of statistically significant results impossible if one wants to do exploratory research. This is a problem because it makes exploratory research impossible, if one wants to report statistically significant results. On the other hand, it makes the report of statistically significant results impossible if one wants to do exploratory research.

13 Generalizing Results It is often believed that if some effect concerning a learning algorithm is shown to hold on a random subset of the UCI data sets, then this effect should also hold on other data sets. It is often believed that if some effect concerning a learning algorithm is shown to hold on a random subset of the UCI data sets, then this effect should also hold on other data sets. This is not necessarily the case because the UCI repository, as was shown by Holte (and others), only represent a very limited sample of problems, many of which are easy for a classifier. In other words, the repository is not an unbiased sample of classification problems. This is not necessarily the case because the UCI repository, as was shown by Holte (and others), only represent a very limited sample of problems, many of which are easy for a classifier. In other words, the repository is not an unbiased sample of classification problems. A second problem with too much reliance on community data sets such as the UCI Repository is that, consciously or not, researchers start developing algorithms tailored to those data sets. [E,g, they may develop algorithms for, say, missing data because the repository contains such problems, even if in reality, this turns out not to be a very prevalent problem.] A second problem with too much reliance on community data sets such as the UCI Repository is that, consciously or not, researchers start developing algorithms tailored to those data sets. [E,g, they may develop algorithms for, say, missing data because the repository contains such problems, even if in reality, this turns out not to be a very prevalent problem.]

14 A Recommended Approach

15 Two Extra Points regarding Valid Testing Running many cross-validations on the same data set, and reporting each cross- validation as a single trial does not produce valid statistics because the trials, in such a design, are highly interdependent. Running many cross-validations on the same data set, and reporting each cross- validation as a single trial does not produce valid statistics because the trials, in such a design, are highly interdependent. If one wishes to extend the recommended procedure to several data sets rather than a single one, one should use the Bonferroni adjustment to adjust the significance levels accordingly. If one wishes to extend the recommended procedure to several data sets rather than a single one, one should use the Bonferroni adjustment to adjust the significance levels accordingly.