Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:

Similar presentations


Presentation on theme: "1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:"— Presentation transcript:

1 1 CSI5388 Practical Recommendations

2 2 Context for our Recommendations I This discussion will take place in the context of the following three questions: I have created a new classifier for a specific problem. How does it compare to other existing classifiers on this particular problem?I have created a new classifier for a specific problem. How does it compare to other existing classifiers on this particular problem? I have designed a new classifier, how does it compare to existing classifiers on benchmark data?I have designed a new classifier, how does it compare to existing classifiers on benchmark data? How do various classifiers fare on benchmark data or on a single new problem?How do various classifiers fare on benchmark data or on a single new problem?

3 3 These three questions can be translated into the four different situations: Situation 1: Comparison of a new classifier to generic ones for a specific problemSituation 1: Comparison of a new classifier to generic ones for a specific problem Situation 2: Comparison of a new classifier to generic ones on generic problemsSituation 2: Comparison of a new classifier to generic ones on generic problems Situation 3: Comparison of generic classifiers on generic domainsSituation 3: Comparison of generic classifiers on generic domains Situation 4: Comparison of generic classifiers on a specific problemSituation 4: Comparison of generic classifiers on a specific problem Context for our Recommendations II

4 4 Selecting learning algorithms I The general strategy is to try to select classifiers that are more likely to succeed on the task at hand. The general strategy is to try to select classifiers that are more likely to succeed on the task at hand. Situation 1: Select generic classifiers with a good chance of success at the particular task. Situation 1: Select generic classifiers with a good chance of success at the particular task. E.g., For high dimensionality problem: use SVM as a generic classifierE.g., For high dimensionality problem: use SVM as a generic classifier E.g., For class imbalanced problem: use SMOTE as a generic classifier, etc.E.g., For class imbalanced problem: use SMOTE as a generic classifier, etc. Situation 2: Different from Situation 1 in that not specific problem is targeted. So, choose generic classifiers that are generally accurate and stable across domains Situation 2: Different from Situation 1 in that not specific problem is targeted. So, choose generic classifiers that are generally accurate and stable across domains E.g., Random Forests, SVMs, BaggingE.g., Random Forests, SVMs, Bagging

5 5 Selecting learning algorithms II Situation 3: Different from Situations 1 and 2. This time, we are interested in finding the strengths and weaknesses of various algorithms on different problems. So, select various well- known and well-used algorithms. Not necessarily the best algorithms overall. Situation 3: Different from Situations 1 and 2. This time, we are interested in finding the strengths and weaknesses of various algorithms on different problems. So, select various well- known and well-used algorithms. Not necessarily the best algorithms overall. E.g., Decision Trees, Neural Networks, Naïve Bayes, Nearest Neighbours, SVMs, etc.E.g., Decision Trees, Neural Networks, Naïve Bayes, Nearest Neighbours, SVMs, etc. Situation 4: reduces to Case 1 where what matters is the search for an optimal classifier or to Case 3, where the purpose is of a more general and scientific nature. Situation 4: reduces to Case 1 where what matters is the search for an optimal classifier or to Case 3, where the purpose is of a more general and scientific nature.

6 6 Selecting Data Sets I The selection of data sets is different in the cases of Situations 1 and 4 and Situations 2 and 3. The selection of data sets is different in the cases of Situations 1 and 4 and Situations 2 and 3. Situations 1 and 4: We distinguish between two cases: Situations 1 and 4: We distinguish between two cases: Case 1: There is just one data set of interest – Just use this data set.Case 1: There is just one data set of interest – Just use this data set. Case 2: We are considering a class of data sets (e.g., data sets for text categorization). In this case, we should look at Situations 2 and 3, since data sets in the same class can have different characteristics (e.g., noise, class imbalances, etc). The only difference is that the domains in this class will be more closely related than those in a wider study of the kind considered in Situations 2 and 3.Case 2: We are considering a class of data sets (e.g., data sets for text categorization). In this case, we should look at Situations 2 and 3, since data sets in the same class can have different characteristics (e.g., noise, class imbalances, etc). The only difference is that the domains in this class will be more closely related than those in a wider study of the kind considered in Situations 2 and 3.

7 7 Selecting Data Sets II Situations 2 and 3: The first thing that we need to do is determine what the exact purpose of the study is. Situations 2 and 3: The first thing that we need to do is determine what the exact purpose of the study is. Case 1: To test a specific characteristic of a new algorithm or of various algorithms (e.g., their resilience to noise) – Select domains presenting the same characteristicsCase 1: To test a specific characteristic of a new algorithm or of various algorithms (e.g., their resilience to noise) – Select domains presenting the same characteristics Case 2: To test the general performance of a new algorithm or of various algorithms on a variety of domains with different characteristics — Select varied domains, but watch the way in which you report the results. There may be a lot of variance, from classifier to classifier and type of domain to type of domain. It will be best to cluster the kinds of domains on which classifiers excel or do poorly and report the results on a cluster by cluster basis.Case 2: To test the general performance of a new algorithm or of various algorithms on a variety of domains with different characteristics — Select varied domains, but watch the way in which you report the results. There may be a lot of variance, from classifier to classifier and type of domain to type of domain. It will be best to cluster the kinds of domains on which classifiers excel or do poorly and report the results on a cluster by cluster basis.

8 8 Selecting Data Sets III Situations 2 and 3 (Cont’d): Three questions remain: Situations 2 and 3 (Cont’d): Three questions remain: Question 1: How many data sets are necessary / desirable?Question 1: How many data sets are necessary / desirable? Question 2: Where can we get these data sets?Question 2: Where can we get these data sets? Question 3: How do we select data sets from those available?Question 3: How do we select data sets from those available?

9 9 Selecting Data Sets IV Situations 2 & 3: How many data sets? The number of domains necessary depends on the variance in the performance of the classifiers. As a rule of thumb, 3 to 5 domains within the same category of domains are desirable to begin with. Note: As domains get added, the question raised by [Salzberg, 1997] and [Jensen, 2001] regarding the multiplicity effect should be considered. The number of domains necessary depends on the variance in the performance of the classifiers. As a rule of thumb, 3 to 5 domains within the same category of domains are desirable to begin with. Note: As domains get added, the question raised by [Salzberg, 1997] and [Jensen, 2001] regarding the multiplicity effect should be considered. Situations 2 & 3: Where can we get these data sets? UCI Repository for machine learning or other repositories (but the collections may not be representative of reality). UCI Repository for machine learning or other repositories (but the collections may not be representative of reality). Directly from the Web (but gathering a cleaning a data collection is extremely time consuming) Directly from the Web (but gathering a cleaning a data collection is extremely time consuming) Artificial data sets (easy to build, unlimited in size, but too far removed from reality) Artificial data sets (easy to build, unlimited in size, but too far removed from reality) Real-world inspired artificial data (real-world data sets artificially augmented. Easy to build, closer to reality) Real-world inspired artificial data (real-world data sets artificially augmented. Easy to build, closer to reality)

10 10 Selecting Data Sets V Situations 2 & 3: How do we select data sets from those available? Select all those that are available and meet the constraints of the algorithms that are under study. For example, the UCI repository contain many data sets, but only a subset of these are multi-class, only a subset has nominal attributes only, only a subset has no missing attributes, and so on. Select all those that are available and meet the constraints of the algorithms that are under study. For example, the UCI repository contain many data sets, but only a subset of these are multi-class, only a subset has nominal attributes only, only a subset has no missing attributes, and so on. In order to increase the number of domains available for use by researchers or practitioners of Data Mining, some amendments to the data sets can be made to make as many data sets as possible conform to the requirements of the classifiers. In order to increase the number of domains available for use by researchers or practitioners of Data Mining, some amendments to the data sets can be made to make as many data sets as possible conform to the requirements of the classifiers.

11 11 Selecting performance measures Cases 2 and 3: Caruana and Niculescu-Mizil, 2004 suggest that the Root mean Squared error is the best general-purpose method since it is the one that is best correlated with the other eight measures that they use. Researchers are, however, encouraged to use a variety of different metrics in order to discover the various strengths and shortcomings of each classifier and each domain more specifically. Cases 2 and 3: Caruana and Niculescu-Mizil, 2004 suggest that the Root mean Squared error is the best general-purpose method since it is the one that is best correlated with the other eight measures that they use. Researchers are, however, encouraged to use a variety of different metrics in order to discover the various strengths and shortcomings of each classifier and each domain more specifically. Cases 1 and 4: We distinguish between the following cases: Cases 1 and 4: We distinguish between the following cases: Balanced versus imbalanced domains: ROCBalanced versus imbalanced domains: ROC Certainty of the decision matters: B & KCertainty of the decision matters: B & K All the classes matter: RMSEAll the classes matter: RMSE The problem is binary but one class matters more than the other: Precision, Recall, F-measure, Sensitivity, Specificity, Likelihood Ratios.The problem is binary but one class matters more than the other: Precision, Recall, F-measure, Sensitivity, Specificity, Likelihood Ratios.

12 12 Selecting an error estimation method and statistical test I If the size of the data set is large enough (the size of all testing sets is, at least, 30) and if the statistics of interest to the user is parameterizable: cross-validation can be tried (but see the next slide). If the size of the data set is large enough (the size of all testing sets is, at least, 30) and if the statistics of interest to the user is parameterizable: cross-validation can be tried (but see the next slide). If the data set is particularly small, i.e., if some of the testing sets contain fewer than 30 examples: say, if it contains fewer than 30, or so samples: Bootstrapping or Randomization. If the data set is particularly small, i.e., if some of the testing sets contain fewer than 30 examples: say, if it contains fewer than 30, or so samples: Bootstrapping or Randomization. If the statistics of interest does not have statistical tests associated with it: Bootstrapping or Randomization. If the statistics of interest does not have statistical tests associated with it: Bootstrapping or Randomization.

13 13 Selecting an error estimation method and statistical test II  Question: How can one see whether cross-validation is appropriate for his/her purposes?  2 ways:  Visual: plot the distribution and check its shape visually  Apply a Hypothesis Test designed to see if the distribution is normal or not. (e.g. Chi squared goodness of fit, Kolmogorov-Smirnov goodness of fit, etc.) Since no practical distribution will be exactly Normal, we must also look into the robustness of the various statistical method considered. The t-test is quite robust against the normality assumption. Since no practical distribution will be exactly Normal, we must also look into the robustness of the various statistical method considered. The t-test is quite robust against the normality assumption. If the distribution is far from normal non-parametric tests must be used. If the distribution is far from normal non-parametric tests must be used.

14 14 Selecting an error estimation method and statistical test III The robustness of a procedure is important since that will ensure that the reported significance level is close to the true one. The robustness of a procedure is important since that will ensure that the reported significance level is close to the true one. However, Robustness does not answer the question of whether efficient use is made of the data so that a false null hypothesis can be rejected. However, Robustness does not answer the question of whether efficient use is made of the data so that a false null hypothesis can be rejected.  Power should be considered The power of a test depends on some intrinsic nature of that test, but also on the shape and size of the population to which it is applied. The power of a test depends on some intrinsic nature of that test, but also on the shape and size of the population to which it is applied. Example: Parametric tests based on the normality assumption are generally as powerful or more powerful than non-parametric tests based on ranks in the case of distribution functions with lighter tails than the normal distribution. Example: Parametric tests based on the normality assumption are generally as powerful or more powerful than non-parametric tests based on ranks in the case of distribution functions with lighter tails than the normal distribution.

15 15 Selecting an error estimation method and statistical test IV But: Parametric tests based on the normality assumption are less powerful than non- parametric ones in the case where the tails of the distribution are heavier than those of the normal distribution (An important kind of data presenting such distributions are data containing outliers). But: Parametric tests based on the normality assumption are less powerful than non- parametric ones in the case where the tails of the distribution are heavier than those of the normal distribution (An important kind of data presenting such distributions are data containing outliers). Note that the relative power of parametric and non-parametric tests does not change as a function of sample size, even if a test is asymptotically distribution free (i.e., if it becomes more and more robust as the sample size increases). Note that the relative power of parametric and non-parametric tests does not change as a function of sample size, even if a test is asymptotically distribution free (i.e., if it becomes more and more robust as the sample size increases).


Download ppt "1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:"

Similar presentations


Ads by Google