CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages 152-156.

Slides:



Advertisements
Similar presentations
Learning Algorithm Evaluation
Advertisements

Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned
Model Assessment, Selection and Averaging
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Resampling techniques
Evaluation.
Ensemble Learning: An Introduction
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Evaluation of Learning Models
Sample Design.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Evaluating Classifiers
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Issues with Data Mining
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
6.2 – Binomial Probabilities You are at your ACT test, you have 3 problems left to do in 5 seconds. You decide to guess on all three, since you don't have.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 7 Sampling and Sampling Distributions.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Evaluating Results of Learning Blaž Zupan
1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott.
Machine Learning Chapter 5. Evaluating Hypotheses
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Validation.
1 Estimating Accuracy Holdout method – Randomly partition data: training set + test set – accuracy = |correctly classified points| / |test data points|
Chapter 6 Cross Validation.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning: Ensemble Methods
7. Performance Measurement
Evaluating Results of Learning
9. Credibility: Evaluating What’s Been Learned
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Machine Learning Techniques for Data Mining
CSCI N317 Computation for Scientific Applications Unit Weka
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Introduction to Machine learning
Presentation transcript:

CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages

Training & Testing Dilemma Want a large training data set Want a large testing dataset Often don’t have enough good data

Training & Testing Resubstitution error rate – error rate resulting from testing on the training data  This error rate will be highly optimistic  Not a good indicator of what the performance will be on an independent test dataset

Evaluation in Weka

Over fitting - Negotiations

Over fitting - Diabetes 1R with default bucket value of 6 plas: tested_negative tested_positive tested_negative tested_positive tested_negative tested_positive tested_negative tested_positive tested_negative >= > tested_positive (587/768 instances correct) 71.5% correct

Over fitting - Diabetes 1R with bucket value of 20 plas: tested_negative >= > tested_positive (573/768 instances correct) 72.9% correct

Over fitting - Diabetes 1R with bucket value of 50 plas: tested_negative >= > tested_positive (576/768 instances correct) 74.2% correct

Over fitting - Diabetes 1R with bucket value of 200 preg: tested_negative >= 6.5-> tested_positive (521/768 instances correct) 66.7% correct

Holdout Holdout procedure – hold out some data for testing Recommendation – when have enough data, holdout 1/3 of data for testing (use 2/3 rd for training)

Stratified Holdout Stratified holdout – check that each class is represented in approximately equal proportions in the testing dataset as it was in the overall dataset

Evaluation Techniques when don’t have enough data Techniques:  Cross Validation,  Stratified Cross Validation,  Leave-One-Out Cross Validation and  Bootstrapping

Repeated Holdout Method Repeated holdout method – Use multiple iterations, in each iteration a certain proportion of the dataset is randomly selected for training (possibly with stratification). The error rates on the different iterations are averaged to yield an overall error rate

Possible Problem This is still not optimum, when the proportion to be held out for testing is randomly selected, the testing sets may overlap.

Cross-Validation Cross-validation – decide a fixed number of “folds” or partitions of the dataset. For each of the n folds train with (n-1)/n of the dataset, test with 1/n of the dataset to estimate the error Typical stages:  Split the data into n subsets of equal size  Use each subset in turn for testing, the remaining for training  Average the results

Stratified Cross-Validation Stratified n-folds cross validation, each split is made to have instances with the class variable represented proportionally

Recommendation When Insufficient Data 10-fold cross validation with stratification has become the standard. Book states:  Extensive experiments have shown that this is the best choice to get an accurate estimate  There is some theoretical evidence that this is the best choice Controversy still rages in the machine learning community

Leave-One-Out Cross-Validation Leave-One-Out Cross-Validation - the number of folds is the same as the number of training instances Pros:  Makes the best use of the data since the greatest possible amount of data is used for training  Involves no random sampling Cons:  Computationally expensive (increases directly as there are more instances)  None of the samples will be stratified

Bootstrap Methods Bootstrap – uses sampling with replacement to form the training set  Sample a dataset of n instances n times with replacement to form a new dataset of n instances  Use this data as the training set  Use the instances from the original dataset that don’t occur in the new training set for testing

0.632 bootstrap Likelihood of an element not being chosen to be in the training set? (1 – 1/n) Repeat this process n times – likelihood of not being chosen? (1 – 1/n) n  (1-1/2) 2 = 0.25  (1-1/3) 3 =  (1-1/4) 4 =  (1-1/5) 5 =  (1-1/6) 6 =  (1-1/7) 7 =  (1-1/8) 8 =  (1-1/9) 9 =  (1-1/10) 10 = .  (1-1/500) 500 =  (1-1/n) n converges to 0.368

0.632 bootstrap So an instance for largish n (n=500) has a likelihood of not being chosen The instance has a = chance of being selected bootstrap method

0.632 bootstrap For Bootstrapping the error estimate on the test data will be very pessimistic since the training only occurred on ~63% of the instances. Therefore, combine it with weighted resubstitution error: Error = 0.632*error test_instances * error training_instances Repeat the process several times with different replacement samples; average the results Bootstrapping is probably the best way of estimating performance for small datasets