On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.

Slides:



Advertisements
Similar presentations
Chapter 10 Introduction to Inference
Advertisements

Statistical vs Clinical Significance
Probability models- the Normal especially.
On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
HYPOTHESIS TESTING. Purpose The purpose of hypothesis testing is to help the researcher or administrator in reaching a decision concerning a population.
Introduction to Hypothesis Testing
Anthony Greene1 Simple Hypothesis Testing Detecting Statistical Differences In The Simplest Case:  and  are both known I The Logic of Hypothesis Testing:
Issues About Statistical Inference Dr R.M. Pandey Additional Professor Department of Biostatistics All-India Institute of Medical Sciences New Delhi.
Chapter 9 Hypothesis Testing
Decision Errors and Power
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Experimental Evaluation in Computer Science: A Quantitative Study Paul Lukowicz, Ernst A. Heinz, Lutz Prechelt and Walter F. Tichy Journal of Systems and.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Chapter 6 Hypotheses texts. Central Limit Theorem Hypotheses and statistics are dependent upon this theorem.
Basic Data Mining Techniques Chapter Decision Trees.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Evaluating Hypotheses
Statistics for the Social Sciences Psychology 340 Fall 2006 Hypothesis testing.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 9-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Statistics for the Social Sciences Psychology 340 Spring 2005 Hypothesis testing.
Experimental Evaluation in Computer Science: A Quantitative Study Paul Lukowicz, Ernst A. Heinz, Lutz Prechelt and Walter F. Tichy Journal of Systems and.
BCOR 1020 Business Statistics
PSY 307 – Statistics for the Behavioral Sciences
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Copyright c 2001 The McGraw-Hill Companies, Inc.1 Chapter 7 Sampling, Significance Levels, and Hypothesis Testing Three scientific traditions critical.
Inferential Statistics
Choosing Statistical Procedures
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
Presented by Mohammad Adil Khan
Chapter 8 Introduction to Hypothesis Testing
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
Inductive learning Simplest form: learn a function from examples
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
No criminal on the run The concept of test of significance FETP India.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
1 ConceptsDescriptionHypothesis TheoryLawsModel organizesurprise validate formalize The Scientific Method.
Hypothesis and Test Procedures A statistical test of hypothesis consist of : 1. The Null hypothesis, 2. The Alternative hypothesis, 3. The test statistic.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Introduction Hypothesis testing for one mean Hypothesis testing for one proportion Hypothesis testing for two mean (difference) Hypothesis testing for.
Introduction to Inferece BPS chapter 14 © 2010 W.H. Freeman and Company.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 7 Sampling, Significance Levels, and Hypothesis Testing Three scientific traditions.
BPS - 5th Ed. Chapter 151 Thinking about Inference.
© Copyright McGraw-Hill 2004
STA Lecture 221 !! DRAFT !! STA 291 Lecture 22 Chapter 11 Testing Hypothesis – Concepts of Hypothesis Testing.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
1 Section 8.2 Basics of Hypothesis Testing Objective For a population parameter (p, µ, σ) we wish to test whether a predicted value is close to the actual.
Chapter 8: Introduction to Hypothesis Testing. Hypothesis Testing A hypothesis test is a statistical method that uses sample data to evaluate a hypothesis.
Chapter 9: Hypothesis Tests for One Population Mean 9.2 Terms, Errors, and Hypotheses.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Hypothesis Testing Chapter Hypothesis Testing  Developing Null and Alternative Hypotheses  Type I and Type II Errors  One-Tailed Tests About.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Data Mining is the process of analyzing data and summarizing it into useful information Data Mining is usually used for extremely large sets of data It.
Mining Statistically Significant Co-location and Segregation Patterns.
Critical Reading of Clinical Study Results
CONCEPTS OF HYPOTHESIS TESTING
P-value Approach for Test Conclusion
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
Statistical Inference
Welcome! Knowledge Discovery and Data Mining
AP STATISTICS LESSON 10 – 4 (DAY 2)
1 Chapter 8: Introduction to Hypothesis Testing. 2 Hypothesis Testing The general goal of a hypothesis test is to rule out chance (sampling error) as.
Presentation transcript:

On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu

Introduction Data mining researchers often use classifiers to identify important classes of objects within a data repository. Classification is particularly useful when a database contains examples that can be used as the basis for future decision making; e.g., for assessing credit risks, for medical diagnosis, or for scientific data analysis. Researchers have a range of different types of classification algorithms at their disposal, including nearest neighbor methods, decision tree induction, error back propagation, reinforcement learning, and rule learning.

Introduction Problem: How does one choose which algorithm to use for a new problem? This paper discusses some of the pitfalls that confront anyone trying to answer this question, and demonstrates how misleading results can easily follow from a lack of attention to methodology.

Definition T-test: The t-test assesses whether the means of two groups are statistically different from each other. P-value: It represents probability of concluding (incorrectly) that there is a difference in samples when no true difference exists. Dependent upon the statistical test being performed. P = 0.05 means that there is 5% chance that you would be wrong if concluding the populations are different. NULL Hypothesis Assumption that there is no difference in two or more populations.

Comparing algorithms Classification research, which is a component of data mining as well as a subfield of machine learning, has always had a need for very specific, focused studies that compare algorithms carefully. Based on [1], a high percentage of new algorithms (29%) were not evaluated on any real problem at all, and that very few (only 8%) were compared to more than one alternative on real data. [1]. Prechelt, L. A quantitative study of experimental evaluations of neural network learning algorithms: Current research practice. Neural Networks, 9, 1996.

Pitfalls Classification research and data mining rely too heavily on stored repositories of data. – It is difficult to produce major new results using well-studied and widely shared data. Easy studies need considerable skills. – Most comparative study does propose an entirely new method, most often it proposes changes to one or more known algorithms, and uses comparisons to show where and how the changes will improve performance. Although the goal is worthwhile, the approach is sometimes not valid.

Problem 1: Sharing a small repository of datasets Suppose 100 people are studying the effect of algorithms A and B. At least 5 will get results statistically significant at p <= 0.05 (assuming independent experiments). These results are nothing but due to chance.

Problem 2: Statistics were not designed for computational experiments Comparison of classifier algorithms 154 datasets NULL hypothesis is true if p-value < 0.05 (not very stringent) Differences were reported significant if a t-test produced p- value < – Actual p-value used is 154*0.05 (= 7.7).

Problem 2: Statistics were not designed for computational experiments Let the significance for each level be Chance for making right conclusion for one experiment is 1- Assuming experiments are independent of one another, chance for getting n experiments correct is (1- ) n Chances of not making correct conclusion is 1-(1- ) n Substituting =0.05 Chances for making incorrect conclusion is To obtain results significant at 0.05 level with 154 tests 1-(1- )154 < 0.05 or < This adjustment is well known as Bonferroni Adjustment.

Problem 3: Experiments are not independent The t-test assumes that the test sets for each algorithm are independent. When two algorithms are compared on the same data set, then obviously the test sets are not independent, since they will share some of the same examplesassuming the training and test sets are created by random partitioning, which is the standard practice. The whole framework of using alpha levels and p-values has been questioned when more than two hypotheses are under consideration[2]. [2 ]. Raftery, A. Bayesian model selection in social research (with discussion by Andrew Gelman, Donald B. Rubin, and Robert M. Hauser). In Peter Marsden, editor, Sociological Methodology 1995, pages 111–196. Blackwells, Oxford, UK, 1995.

Problem 4: Only considers overall accuracy on a test set Comparison must consider 4 numbers when a common test set to compare two algorithms (A and B) A > B A < B A = B ~A = ~B If only two algorithms compared – Throw out ties. – Compare A>B Vs A<B If more than two algorithms compared – Use Analysis of Variance (ANOVA) – Bonferroni adjustment for multiple test should be applied

Problem 5: Repeated tuning Many researchers tune their algorithms repeatedly in order to make them perform optimally on some dataset. The representation of the data, which may vary from one study to the next, even when the same basic dataset is used. Whenever tuning takes place, every adjustment should really be considered as a separate experiment. A greater problem occurs when one uses an algorithm that has been used before: the algorithm may already have been tuned on public databases.

Problem 5: Repeated tuning The recommended procedure is to reserve a portion of the training set as a tuning set, and to repeatedly test the algorithm and adjust its parameters on the tuning set. When the parameters appear to be at their optimal settings, accuracy can finally be measured on the test data.

Problem 6: Generalizing results A common methodological approach in recent comparative studies is to pick several datasets from the UCI repository (or other data source) and perform a series of experiments, measuring classification accuracy, learning rates, and perhaps other criteria. It is not valid to make general statements about other datasets. The only way this might be valid would be if the UCI repository were known to represent a larger population of classification problems.

A recommended approach 1.Choose other algorithms to include in the comparison. Try including most similar to new algorithm. 2. Choose datasets. 3.Divide the data set into k subsets for cross validation. – Typically k=10. – For a small data set, choose larger k, since this leaves more examples in the training set.

A recommended approach 4.Run a cross-validation – For each of the k subsets of the data set D, create a training set T = D – k – Divide T into T1 (training) and T2 (tuning) subsets – Once tuning is done, rerun training on T – Finally measure accuracy on k – Overall accuracy is averaged across all k partitions. 5.compare algorithms In case of multiple data sets, Bonferroni adjustment should be applied.

Conclusion No single classification algorithm is the best for all problems. Comparative studies typically include at least one new algorithm and several known methods; these studies must be very careful about their methods and their claims.

References [1]. Prechelt, L. A quantitative study of experimental evaluations of neural network learning algorithms: Current research practice. Neural Networks, 9, [2]. Raftery, A. Bayesian model selection in social research (with discussion by Andrew Gelman, Donald B. Rubin, and Robert M. Hauser). In Peter Marsden, editor, Sociological Methodology 1995, pages 111–196. Blackwells, Oxford, UK, 1995.