Evaluation and Its Methods How to know a method is good or better than another Measuring Data Mining Algorithms What are your intuitive ideas? 5/2/2019 CSE 572: Data Mining by H. Liu Copyright, 1996 © Dale Carnegie & Associates, Inc.
CSE 572: Data Mining by H. Liu Why should we evaluate Comparison Goodness of the mining schemes Difference between f (the true function) and f’ (the learned one) Avoiding over-fitting data What is over-fitting? Challenge How do we measure something we don’t really know Criteria for comparison objective repeatable fair 5/2/2019 CSE 572: Data Mining by H. Liu
What to compare? Measures vary For classification, we may compare along accuracy compactness comprehensibility time … For clustering, ... For association rules, ... 5/2/2019 CSE 572: Data Mining by H. Liu
How to obtain evaluation results Just trust me – does that work? Training data only (resubstitution) Training-Testing (2/3 and 1/3) Training-Validation-Testing Cross-Validation Leave-one-out Bootstrap: random sampling with replacement 63.2% data is used each time on average One important step - shuffling data For each instance in a dataset of n instances, its p of being picked is 1/n, its p of not being picked is 1-1/n. To pick n times and an instance is not picked, its p is (1-1/n)^n = e^(-1)= 0.368 5/2/2019 CSE 572: Data Mining by H. Liu
CSE 572: Data Mining by H. Liu Basic concepts True positive (TP), false positive (FP) True negative (TN), false negative (FN) One definition of accuracy: (TP+TN)/Sum Sum = TP+FP+TN+FN Error rate is (1 - accuracy) Various other definitions Precision - TP/(TP+FP) Recall - TP/(TP+FN) F measure = 2P*R/(P+R) 5/2/2019 CSE 572: Data Mining by H. Liu
CSE 572: Data Mining by H. Liu Examples of curves If a prediction can be associated with a probability, then, we can have the following Organizing predicted data Rank them according to predicted probabilities in descending order Lift charts: Number of respondents vs. sample size (%) ROC (Receiver Operating Characteristic) curves: Use the ranked data Plot true positive% (Y) vs. false positive% (X) Figure source: http://gim.unmc.edu/dxtests/roc2.htm Additional examples can be found Witten & Frank book 5/2/2019 CSE 572: Data Mining by H. Liu
Paper presentation work When is a good due date? You should submit a hard copy of the required (please refer to the instructor’s course website). After the submission, you can still modify your slides. So, you need to provide a URL in the hardcopy where we can get the latest slides. We will arrange the presentation order accordingly after selection. It depends on the time available – we may have to select some papers to present based on the quality of the preparation. 5/2/2019 CSE 572: Data Mining by H. Liu
CSE 572: Data Mining by H. Liu Some issues Sizes of data Large/Small How much data is needed Roughly vs. theoretically Happy curves (viewing the effect of increasing data) Subjectivity What is it needed? There can be many legitimate solutions (8 gold coins) Applications Categorization/Prediction Microarrays Adding the cost Different errors can incur very different costs. 5/2/2019 CSE 572: Data Mining by H. Liu
Evaluating numeric prediction mean-squared error root mean-squared error mean absolute error correlation coefficient and many others (please refer to your favorite text book) Source: Witten and Frank’s book 5/2/2019 CSE 572: Data Mining by H. Liu
Cross Validation (CV - revisit) n-fold CV (a common n is 10) Mean = (Ai)/n, where Ai is accuracy for ith run in a total of n runs. Variance 2= ((xi- )2)/(n-1) K n-fold CV and its procedure Loop K times shuffle data and divide it into n folds Loop n times (n-1) folds are used for training, 1 fold for testing 5/2/2019 CSE 572: Data Mining by H. Liu
Comparing different DM algorithms Which algorithm is better between any two An example from Assignment 3 Using different data sets Need to consider both mean and variance differences The intuitive ideas are (1) to see if one is consistently better than the other, or (2) they are not significantly (in a statistical sense) different from each other Hypothesis test and Type I (α) and II (β) errors http://davidmlane.com/hyperstat/A18652.html Test Statistics (t, F, and chi-square) one-tail or two-tail test Depending on what your null hypothesis is difference in means Type I: reject a true null hypothesis Type II: not reject a false null hypothesis 5/2/2019 CSE 572: Data Mining by H. Liu
Comparing two sets of results Null hypothesis H0 – the two means are equal The alternative of H0 is that they are not equal Paired or not paired tests Fixed level testing Significance level α (5%, 10%) is preset α is also Type I error (rejecting true H0 ) Confidence interval: 1 - α Critical region (CR) in which if any of them are observed, sth extreme has occurred One- or two tail Problem: when it’s in CR No difference is discernable http://www.sportsci.org/resource/stats/pvalues.html http://www.graphpad.com/www/book/Interpret.htm http://home.ubalt.edu/ntsbarsh/Business-stat/otherapplets/pvalues.htm α/2 5/2/2019 CSE 572: Data Mining by H. Liu
CSE 572: Data Mining by H. Liu P value Observed significance value A statistics program can calculate p-value It is the smallest fixed level at which the null hypothesis can be rejected Statistically significant if p-value is less than the preset value of α; the results would be surprising if H0 is true, hence reject H0 Using p-values: let α = 0.05 p-value = 1 > α, accept H0 p-value = 0.02 < α, reject H0 More details can be found at http://www.tufts.edu/~gdallal/pval.htm An example to calculate p-value http://en.wikipedia.org/wiki/P-value α/2 5/2/2019 CSE 572: Data Mining by H. Liu
CSE 572: Data Mining by H. Liu Costs of Predictions Different errors may incur different costs Positive and negative predictions can have dramatically different costs Including the cost consideration in learning can significantly influence the outcome Cost-sensitive learning examples Disease diagnosis - which type of errors (false positive or false negative) is more important? Intrusion detection Car alarm, house alarm How to introduce costs in measurement A 2-class, 3-class, or k-class problem 5/2/2019 CSE 572: Data Mining by H. Liu
Project proposal and onward Let’s look at the course website What do you want to achieve? What are the ideas you have? Among which, what are feasible ideas? What if there is no idea? Where to find project ideas What are the difficulties? Interesting or big problems? Available data now or later? 5/2/2019 CSE 572: Data Mining by H. Liu
CSE 572: Data Mining by H. Liu Now what about project? Act on it NOW What are the common challenges in the group projects so that they can be attacked separately? Let’s list some of them here … What if I fail? We award failures too if they are dissected carefully, and if the experience can benefit others 5/2/2019 CSE 572: Data Mining by H. Liu
CSE 572: Data Mining by H. Liu Summary Many ways to measure Which way to use The key is to follow the acceptable standards where you want your results published/accepted Reproducibility of the empirical results is utterly important Using benchmark datasets if possible Subjectivity in evaluation What is fair? Don’t forget the ‘8 gold coins’ problem So try to objectively explain your results 5/2/2019 CSE 572: Data Mining by H. Liu