Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine Learning Tools and Techniques with JAVA Implementations, Morgan Kaufmann, [Lim et al99] Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms, Machine Learning. Forthcoming. (Appendix containing complete tables of error rates, ranks, and training times; download the data sets in C4.5 format)
Evaluation of Methods Ideally : find the best one Practically: find comparable classes Classification accuracy –Effect of noise on accuracy Comprehensibility of result –Compactness –Complexity Training time Scalability with increase in sample size
Types of classifiers Decision trees Neural Network Statistical –Regression Pi = w0 + w1*x1 + w2*x wm*xm Use regression on data set: Min Sum(Ci - (w0 + w1*x wm*xm ))^2 to get the weights. –Classification Perform regression for each class; output = 1 for Ci, 0 otherwise, to get a linear expression Given test data, evaluate each linear expression and choose the class corresponding to the largest
Assumptions in classification Data set is a representative sample Prior probability is proportional to the frequency in training sample sizes Changing categories to numerical values If attribute X takes k values: C 1, C 2,..., C k, have (k-1) dimensional vector d 1, d 2,..., d k-1 such that d i =1 if X= C i and d i =0, otherwise. For X= C k, the vector contains all zeros.
Error Rate of Classifier If data set is large –use result from test data –test data is usually 1/3 of total data If data size is small –K-fold cross validation (usually 10) Divide data set into roughly equal K sets. Use K-1 for training, the remaining for testing Repeat K times and average the error rate
Error Rate of Classifier (cont) –Fold choice can affect result STRATIFICATION: Class ratio in each set is same as in whole data set. Use K K-fold cross validation to overcome random variation in fold choice. –Leave-one-out: N-fold cross validation –Bootstrap: Training Set = Select N data items at random with substitution. Prob = (1 - 1/N)^N = 1/e = Test Set = Data Set - Training Set e = * e(test) * e(training)
Evaluation in [ Lim et al 99] 22 DT, 9 Statistical, 2 NN classifiers Time Measurement –Use SPEC marks to rate platforms and scale results based on these marks. Error rates –Calculation For test data > 1000, use its error rate Else, use 10-fold cross validation –Does not use multiple 10-folds.
Evaluation in [ Lim et al 99] –Acceptable performance If p is minimum error rate of all classifiers, those within 1 standard error on p are accepted Std error of p = Sqrt(p*(1-p)/N) Statistical significance of error rates –Null hypothesis: All algorithms with same mean error rate ? Test differences of mean error rates. (Hypothesis REJECTED). – Tukey method: Difference between mean error rates significant at 10% if differ by
Evaluation in [ Lim et al 99] Rank analysis (no normality assumption) –Data Set(i): Sort according to ascending error rate and assign rank (ties given avg rank) Error rate: Rank: –Get Mean Rank across all Data Set Statistical significance of rank –Null Hypothesis on difference of mean ranks (Friedman test) is REJECTED. –Difference in mean ranks greater than 8.7 is significant at 10% level.
Evaluation in [ Lim et al 99] Equivalent classifiers from best (POL) (Training time <= 10 min) –15 found with mean error rate –18 found with mean rank Training time shown in order slower than fastest classifier (10^(x-1) to 10^x) –Decision tree learners (C4.5, FACT - statistical tests for splitter), regression methods train fast –Spline-based statisticals and NNs slower
Evaluation in [ Lim et al 99] Size of trees (in decision trees) –Use 10-fold cross-validation –Noise typically reduces tree size Scalability with data set size –If data set small, use bootstrap re-sampling N items are drawn with substitution Class attribute is randomly changed with prob = 0.1 to a value from the valid set selected uniformly –Else, use given data size
Evaluation in [ Lim et al 99] –log(training time) increases linearly with log(N) –Decision trees usually scale C4.5 with rules does not scale, but trees do QUEST, FACT (multi-attribute) scale –Regression methods usually scale –NN methods were not tested
[ Lim et al 99] Summary Use method that fits the requirement –error rate, training time,... Decision tree with univariate splits (single attribute) for data interpretation –C4.5 trees are big, C4.5 rules don’t scale Simple regression methods are good: fast, easy implementation, scalable
Precision/ Recall Cost based analysis What more ? False (-)ve True (+)ve True (-)veFalse (+)ve PREDICTED ACTUAL
Demonstration from WEKA