CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Training & Testing Dilemma Want a large training data set Want a large testing dataset Often don’t have enough good data
Training & Testing Resubstitution error rate – error rate resulting from testing on the training data This error rate will be highly optimistic Not a good indicator of what the performance will be on an independent test dataset
Evaluation in Weka
Over fitting - Negotiations
Over fitting - Diabetes 1R with default bucket value of 6 plas: tested_negative tested_positive tested_negative tested_positive tested_negative tested_positive tested_negative tested_positive tested_negative >= > tested_positive (587/768 instances correct) 71.5% correct
Over fitting - Diabetes 1R with bucket value of 20 plas: tested_negative >= > tested_positive (573/768 instances correct) 72.9% correct
Over fitting - Diabetes 1R with bucket value of 50 plas: tested_negative >= > tested_positive (576/768 instances correct) 74.2% correct
Over fitting - Diabetes 1R with bucket value of 200 preg: tested_negative >= 6.5-> tested_positive (521/768 instances correct) 66.7% correct
Holdout Holdout procedure – hold out some data for testing Recommendation – when have enough data, holdout 1/3 of data for testing (use 2/3 rd for training)
Stratified Holdout Stratified holdout – check that each class is represented in approximately equal proportions in the testing dataset as it was in the overall dataset
Evaluation Techniques when don’t have enough data Techniques: Cross Validation, Stratified Cross Validation, Leave-One-Out Cross Validation and Bootstrapping
Repeated Holdout Method Repeated holdout method – Use multiple iterations, in each iteration a certain proportion of the dataset is randomly selected for training (possibly with stratification). The error rates on the different iterations are averaged to yield an overall error rate
Possible Problem This is still not optimum, when the proportion to be held out for testing is randomly selected, the testing sets may overlap.
Cross-Validation Cross-validation – decide a fixed number of “folds” or partitions of the dataset. For each of the n folds train with (n-1)/n of the dataset, test with 1/n of the dataset to estimate the error Typical stages: Split the data into n subsets of equal size Use each subset in turn for testing, the remaining for training Average the results
Stratified Cross-Validation Stratified n-folds cross validation, each split is made to have instances with the class variable represented proportionally
Recommendation When Insufficient Data 10-fold cross validation with stratification has become the standard. Book states: Extensive experiments have shown that this is the best choice to get an accurate estimate There is some theoretical evidence that this is the best choice Controversy still rages in the machine learning community
Leave-One-Out Cross-Validation Leave-One-Out Cross-Validation - the number of folds is the same as the number of training instances Pros: Makes the best use of the data since the greatest possible amount of data is used for training Involves no random sampling Cons: Computationally expensive (increases directly as there are more instances) None of the samples will be stratified
Bootstrap Methods Bootstrap – uses sampling with replacement to form the training set Sample a dataset of n instances n times with replacement to form a new dataset of n instances Use this data as the training set Use the instances from the original dataset that don’t occur in the new training set for testing
0.632 bootstrap Likelihood of an element not being chosen to be in the training set? (1 – 1/n) Repeat this process n times – likelihood of not being chosen? (1 – 1/n) n (1-1/2) 2 = 0.25 (1-1/3) 3 = (1-1/4) 4 = (1-1/5) 5 = (1-1/6) 6 = (1-1/7) 7 = (1-1/8) 8 = (1-1/9) 9 = (1-1/10) 10 = . (1-1/500) 500 = (1-1/n) n converges to 0.368
0.632 bootstrap So an instance for largish n (n=500) has a likelihood of not being chosen The instance has a = chance of being selected bootstrap method
0.632 bootstrap For Bootstrapping the error estimate on the test data will be very pessimistic since the training only occurred on ~63% of the instances. Therefore, combine it with weighted resubstitution error: Error = 0.632*error test_instances * error training_instances Repeat the process several times with different replacement samples; average the results Bootstrapping is probably the best way of estimating performance for small datasets