1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:

Slides:



Advertisements
Similar presentations
On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.
Advertisements

Hypothesis Testing Steps in Hypothesis Testing:
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Hypothesis: It is an assumption of population parameter ( mean, proportion, variance) There are two types of hypothesis : 1) Simple hypothesis :A statistical.
Lecture 22: Evaluation April 24, 2010.
BCOR 1020 Business Statistics Lecture 22 – April 10, 2008.
Evaluating Hypotheses
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
CHI-SQUARE GOODNESS OF FIT TEST What Are Nonparametric Statistics? What is the Purpose of the Chi-Square GOF? What Are the Assumptions? How Does it Work?
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
BCOR 1020 Business Statistics
PSY 307 – Statistics for the Behavioral Sciences
Today Concepts underlying inferential statistics
Chapter 14 Inferential Data Analysis
Richard M. Jacobs, OSA, Ph.D.
Inferential Statistics
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Nonparametric or Distribution-free Tests
Inferential Statistics
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
Choosing Statistical Procedures
Today Evaluation Measures Accuracy Significance Testing
AM Recitation 2/10/11.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
RMTD 404 Lecture 8. 2 Power Recall what you learned about statistical errors in Chapter 4: Type I Error: Finding a difference when there is no true difference.
CHAPTER 16: Inference in Practice. Chapter 16 Concepts 2  Conditions for Inference in Practice  Cautions About Confidence Intervals  Cautions About.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
User Study Evaluation Human-Computer Interaction.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Experimental Evaluation of Learning Algorithms Part 1.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
FOUNDATIONS OF NURSING RESEARCH Sixth Edition CHAPTER Copyright ©2012 by Pearson Education, Inc. All rights reserved. Foundations of Nursing Research,
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
Research Seminars in IT in Education (MIT6003) Quantitative Educational Research Design 2 Dr Jacky Pow.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Experimental Design and Statistics. Scientific Method
1 Statistical Significance Testing. 2 The purpose of Statistical Significance Testing The purpose of Statistical Significance Testing is to answer the.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Academic Research Academic Research Dr Kishor Bhanushali M
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Chapter Seventeen. Figure 17.1 Relationship of Hypothesis Testing Related to Differences to the Previous Chapter and the Marketing Research Process Focus.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
CHAPTERS HYPOTHESIS TESTING, AND DETERMINING AND INTERPRETING BETWEEN TWO VARIABLES.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.
Easy (and not so easy) questions to ask about adolescent health data J. Dennis Fortenberry MD MS Indiana University School of Medicine.
Chapter 13 Understanding research results: statistical inference.
Hypothesis Tests. An Hypothesis is a guess about a situation that can be tested, and the test outcome can be either true or false. –The Null Hypothesis.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Chapter 22 Inferential Data Analysis: Part 2 PowerPoint presentation developed by: Jennifer L. Bellamy & Sarah E. Bledsoe.
1 Underlying population distribution is continuous. No other assumptions. Data need not be quantitative, but may be categorical or rank data. Very quick.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.
I. ANOVA revisited & reviewed
Logic of Hypothesis Testing
Part Four ANALYSIS AND PRESENTATION OF DATA
Data Analysis and Interpretation
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
CHAPTER 16: Inference in Practice
Presentation transcript:

1 CSI5388 Practical Recommendations

2 Context for our Recommendations I This discussion will take place in the context of the following three questions: I have created a new classifier for a specific problem. How does it compare to other existing classifiers on this particular problem?I have created a new classifier for a specific problem. How does it compare to other existing classifiers on this particular problem? I have designed a new classifier, how does it compare to existing classifiers on benchmark data?I have designed a new classifier, how does it compare to existing classifiers on benchmark data? How do various classifiers fare on benchmark data or on a single new problem?How do various classifiers fare on benchmark data or on a single new problem?

3 These three questions can be translated into the four different situations: Situation 1: Comparison of a new classifier to generic ones for a specific problemSituation 1: Comparison of a new classifier to generic ones for a specific problem Situation 2: Comparison of a new classifier to generic ones on generic problemsSituation 2: Comparison of a new classifier to generic ones on generic problems Situation 3: Comparison of generic classifiers on generic domainsSituation 3: Comparison of generic classifiers on generic domains Situation 4: Comparison of generic classifiers on a specific problemSituation 4: Comparison of generic classifiers on a specific problem Context for our Recommendations II

4 Selecting learning algorithms I The general strategy is to try to select classifiers that are more likely to succeed on the task at hand. The general strategy is to try to select classifiers that are more likely to succeed on the task at hand. Situation 1: Select generic classifiers with a good chance of success at the particular task. Situation 1: Select generic classifiers with a good chance of success at the particular task. E.g., For high dimensionality problem: use SVM as a generic classifierE.g., For high dimensionality problem: use SVM as a generic classifier E.g., For class imbalanced problem: use SMOTE as a generic classifier, etc.E.g., For class imbalanced problem: use SMOTE as a generic classifier, etc. Situation 2: Different from Situation 1 in that not specific problem is targeted. So, choose generic classifiers that are generally accurate and stable across domains Situation 2: Different from Situation 1 in that not specific problem is targeted. So, choose generic classifiers that are generally accurate and stable across domains E.g., Random Forests, SVMs, BaggingE.g., Random Forests, SVMs, Bagging

5 Selecting learning algorithms II Situation 3: Different from Situations 1 and 2. This time, we are interested in finding the strengths and weaknesses of various algorithms on different problems. So, select various well- known and well-used algorithms. Not necessarily the best algorithms overall. Situation 3: Different from Situations 1 and 2. This time, we are interested in finding the strengths and weaknesses of various algorithms on different problems. So, select various well- known and well-used algorithms. Not necessarily the best algorithms overall. E.g., Decision Trees, Neural Networks, Naïve Bayes, Nearest Neighbours, SVMs, etc.E.g., Decision Trees, Neural Networks, Naïve Bayes, Nearest Neighbours, SVMs, etc. Situation 4: reduces to Case 1 where what matters is the search for an optimal classifier or to Case 3, where the purpose is of a more general and scientific nature. Situation 4: reduces to Case 1 where what matters is the search for an optimal classifier or to Case 3, where the purpose is of a more general and scientific nature.

6 Selecting Data Sets I The selection of data sets is different in the cases of Situations 1 and 4 and Situations 2 and 3. The selection of data sets is different in the cases of Situations 1 and 4 and Situations 2 and 3. Situations 1 and 4: We distinguish between two cases: Situations 1 and 4: We distinguish between two cases: Case 1: There is just one data set of interest – Just use this data set.Case 1: There is just one data set of interest – Just use this data set. Case 2: We are considering a class of data sets (e.g., data sets for text categorization). In this case, we should look at Situations 2 and 3, since data sets in the same class can have different characteristics (e.g., noise, class imbalances, etc). The only difference is that the domains in this class will be more closely related than those in a wider study of the kind considered in Situations 2 and 3.Case 2: We are considering a class of data sets (e.g., data sets for text categorization). In this case, we should look at Situations 2 and 3, since data sets in the same class can have different characteristics (e.g., noise, class imbalances, etc). The only difference is that the domains in this class will be more closely related than those in a wider study of the kind considered in Situations 2 and 3.

7 Selecting Data Sets II Situations 2 and 3: The first thing that we need to do is determine what the exact purpose of the study is. Situations 2 and 3: The first thing that we need to do is determine what the exact purpose of the study is. Case 1: To test a specific characteristic of a new algorithm or of various algorithms (e.g., their resilience to noise) – Select domains presenting the same characteristicsCase 1: To test a specific characteristic of a new algorithm or of various algorithms (e.g., their resilience to noise) – Select domains presenting the same characteristics Case 2: To test the general performance of a new algorithm or of various algorithms on a variety of domains with different characteristics — Select varied domains, but watch the way in which you report the results. There may be a lot of variance, from classifier to classifier and type of domain to type of domain. It will be best to cluster the kinds of domains on which classifiers excel or do poorly and report the results on a cluster by cluster basis.Case 2: To test the general performance of a new algorithm or of various algorithms on a variety of domains with different characteristics — Select varied domains, but watch the way in which you report the results. There may be a lot of variance, from classifier to classifier and type of domain to type of domain. It will be best to cluster the kinds of domains on which classifiers excel or do poorly and report the results on a cluster by cluster basis.

8 Selecting Data Sets III Situations 2 and 3 (Cont’d): Three questions remain: Situations 2 and 3 (Cont’d): Three questions remain: Question 1: How many data sets are necessary / desirable?Question 1: How many data sets are necessary / desirable? Question 2: Where can we get these data sets?Question 2: Where can we get these data sets? Question 3: How do we select data sets from those available?Question 3: How do we select data sets from those available?

9 Selecting Data Sets IV Situations 2 & 3: How many data sets? The number of domains necessary depends on the variance in the performance of the classifiers. As a rule of thumb, 3 to 5 domains within the same category of domains are desirable to begin with. Note: As domains get added, the question raised by [Salzberg, 1997] and [Jensen, 2001] regarding the multiplicity effect should be considered. The number of domains necessary depends on the variance in the performance of the classifiers. As a rule of thumb, 3 to 5 domains within the same category of domains are desirable to begin with. Note: As domains get added, the question raised by [Salzberg, 1997] and [Jensen, 2001] regarding the multiplicity effect should be considered. Situations 2 & 3: Where can we get these data sets? UCI Repository for machine learning or other repositories (but the collections may not be representative of reality). UCI Repository for machine learning or other repositories (but the collections may not be representative of reality). Directly from the Web (but gathering a cleaning a data collection is extremely time consuming) Directly from the Web (but gathering a cleaning a data collection is extremely time consuming) Artificial data sets (easy to build, unlimited in size, but too far removed from reality) Artificial data sets (easy to build, unlimited in size, but too far removed from reality) Real-world inspired artificial data (real-world data sets artificially augmented. Easy to build, closer to reality) Real-world inspired artificial data (real-world data sets artificially augmented. Easy to build, closer to reality)

10 Selecting Data Sets V Situations 2 & 3: How do we select data sets from those available? Select all those that are available and meet the constraints of the algorithms that are under study. For example, the UCI repository contain many data sets, but only a subset of these are multi-class, only a subset has nominal attributes only, only a subset has no missing attributes, and so on. Select all those that are available and meet the constraints of the algorithms that are under study. For example, the UCI repository contain many data sets, but only a subset of these are multi-class, only a subset has nominal attributes only, only a subset has no missing attributes, and so on. In order to increase the number of domains available for use by researchers or practitioners of Data Mining, some amendments to the data sets can be made to make as many data sets as possible conform to the requirements of the classifiers. In order to increase the number of domains available for use by researchers or practitioners of Data Mining, some amendments to the data sets can be made to make as many data sets as possible conform to the requirements of the classifiers.

11 Selecting performance measures Cases 2 and 3: Caruana and Niculescu-Mizil, 2004 suggest that the Root mean Squared error is the best general-purpose method since it is the one that is best correlated with the other eight measures that they use. Researchers are, however, encouraged to use a variety of different metrics in order to discover the various strengths and shortcomings of each classifier and each domain more specifically. Cases 2 and 3: Caruana and Niculescu-Mizil, 2004 suggest that the Root mean Squared error is the best general-purpose method since it is the one that is best correlated with the other eight measures that they use. Researchers are, however, encouraged to use a variety of different metrics in order to discover the various strengths and shortcomings of each classifier and each domain more specifically. Cases 1 and 4: We distinguish between the following cases: Cases 1 and 4: We distinguish between the following cases: Balanced versus imbalanced domains: ROCBalanced versus imbalanced domains: ROC Certainty of the decision matters: B & KCertainty of the decision matters: B & K All the classes matter: RMSEAll the classes matter: RMSE The problem is binary but one class matters more than the other: Precision, Recall, F-measure, Sensitivity, Specificity, Likelihood Ratios.The problem is binary but one class matters more than the other: Precision, Recall, F-measure, Sensitivity, Specificity, Likelihood Ratios.

12 Selecting an error estimation method and statistical test I If the size of the data set is large enough (the size of all testing sets is, at least, 30) and if the statistics of interest to the user is parameterizable: cross-validation can be tried (but see the next slide). If the size of the data set is large enough (the size of all testing sets is, at least, 30) and if the statistics of interest to the user is parameterizable: cross-validation can be tried (but see the next slide). If the data set is particularly small, i.e., if some of the testing sets contain fewer than 30 examples: say, if it contains fewer than 30, or so samples: Bootstrapping or Randomization. If the data set is particularly small, i.e., if some of the testing sets contain fewer than 30 examples: say, if it contains fewer than 30, or so samples: Bootstrapping or Randomization. If the statistics of interest does not have statistical tests associated with it: Bootstrapping or Randomization. If the statistics of interest does not have statistical tests associated with it: Bootstrapping or Randomization.

13 Selecting an error estimation method and statistical test II  Question: How can one see whether cross-validation is appropriate for his/her purposes?  2 ways:  Visual: plot the distribution and check its shape visually  Apply a Hypothesis Test designed to see if the distribution is normal or not. (e.g. Chi squared goodness of fit, Kolmogorov-Smirnov goodness of fit, etc.) Since no practical distribution will be exactly Normal, we must also look into the robustness of the various statistical method considered. The t-test is quite robust against the normality assumption. Since no practical distribution will be exactly Normal, we must also look into the robustness of the various statistical method considered. The t-test is quite robust against the normality assumption. If the distribution is far from normal non-parametric tests must be used. If the distribution is far from normal non-parametric tests must be used.

14 Selecting an error estimation method and statistical test III The robustness of a procedure is important since that will ensure that the reported significance level is close to the true one. The robustness of a procedure is important since that will ensure that the reported significance level is close to the true one. However, Robustness does not answer the question of whether efficient use is made of the data so that a false null hypothesis can be rejected. However, Robustness does not answer the question of whether efficient use is made of the data so that a false null hypothesis can be rejected.  Power should be considered The power of a test depends on some intrinsic nature of that test, but also on the shape and size of the population to which it is applied. The power of a test depends on some intrinsic nature of that test, but also on the shape and size of the population to which it is applied. Example: Parametric tests based on the normality assumption are generally as powerful or more powerful than non-parametric tests based on ranks in the case of distribution functions with lighter tails than the normal distribution. Example: Parametric tests based on the normality assumption are generally as powerful or more powerful than non-parametric tests based on ranks in the case of distribution functions with lighter tails than the normal distribution.

15 Selecting an error estimation method and statistical test IV But: Parametric tests based on the normality assumption are less powerful than non- parametric ones in the case where the tails of the distribution are heavier than those of the normal distribution (An important kind of data presenting such distributions are data containing outliers). But: Parametric tests based on the normality assumption are less powerful than non- parametric ones in the case where the tails of the distribution are heavier than those of the normal distribution (An important kind of data presenting such distributions are data containing outliers). Note that the relative power of parametric and non-parametric tests does not change as a function of sample size, even if a test is asymptotically distribution free (i.e., if it becomes more and more robust as the sample size increases). Note that the relative power of parametric and non-parametric tests does not change as a function of sample size, even if a test is asymptotically distribution free (i.e., if it becomes more and more robust as the sample size increases).