Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Slides:



Advertisements
Similar presentations
Learning Algorithm Evaluation
Advertisements

October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluation.
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
INTRODUCTION TO Machine Learning 3rd Edition
PSY 307 – Statistics for the Behavioral Sciences
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Evaluation of Learning Models
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Evaluating Classifiers
AM Recitation 2/10/11.
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Slides for “Data Mining” by I. H. Witten and E. Frank.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
Credibility: Evaluating what’s been learned This Lecture based on Ch 5 of Witten & Frank Plan for this week 3 classes before Midterm Paper and Survey discussion.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten and E. Frank.
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
Data Science Credibility: Evaluating What’s Been Learned
Data Science Credibility: Evaluating What’s Been Learned Evaluating Numeric Prediction WFH: Data Mining, Section 5.8 Rodney Nielsen Many of these.
9. Credibility: Evaluating What’s Been Learned
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning Techniques for Data Mining
Learning Algorithm Evaluation
Data Mining Practical Machine Learning Tools and Techniques
Evaluation David Kauchak CS 158 – Fall 2019.
Slides for Chapter 5, Evaluation
Presentation transcript:

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Rodney Nielsen, Human Intelligence & Language Technologies Lab Credibility: Evaluating What’s Been Learned ● Issues: training, testing, tuning ● Predicting performance ● Holdout, cross-validation, bootstrap ● Comparing schemes: the t-test ● Predicting probabilities: loss functions ● Cost-sensitive measures ● Evaluating numeric prediction ● The Minimum Description Length principle

Rodney Nielsen, Human Intelligence & Language Technologies Lab Holdout Estimation ● What to do if the amount of data is limited? ● The holdout method reserves a certain amount for testing and uses the remainder for training  Often: one third for testing, the rest for training ● Supposed Problem: the samples might not be representative  Example: class might be missing in the test data ● Supposed “advanced” version uses stratification  Ensures that each class is represented with approximately equal proportions in both subsets  What might be a negative consequence of this?

Rodney Nielsen, Human Intelligence & Language Technologies Lab Repeated Holdout Method ● Holdout estimate can be made more reliable by repeating the process with different subsamples  In each iteration, a certain proportion is randomly selected for training (possibly with stratification)  The error rates on the different iterations are averaged to yield an overall error rate ● This is called the repeated holdout method ● Or random resampling ● Still not optimum: the different test sets overlap  Can we prevent overlapping?

Rodney Nielsen, Human Intelligence & Language Technologies Lab Cross-Validation ● Cross-validation avoids overlapping test sets  First step: split data into k subsets of equal size  Second step: use each subset in turn for testing, the remainder for training ● Called k-fold cross-validation ● Often the subsets are stratified before the cross- validation is performed ● This is the default in Weka, but it isn’t necessarily a good idea ● The error estimates are averaged to yield an overall error estimate

Rodney Nielsen, Human Intelligence & Language Technologies Lab More on Cross-Validation ● Standard method for evaluation in Weka: stratified ten-fold cross-validation ● Why ten? ● The authors claim:  Extensive experiments have shown this is best choice for accurate estimate  There is also some theoretical evidence for this ● Stratification reduces the estimate’s variance ● But is this appropriate, or does it result in overconfident estimates ● Repeated stratified cross-validation  E.g. Repeat 10xCV ten times and average results (reduces s 2 )  Again, this leads to a false over confidence  For a different view, read: Dietterich. (1998) Approximate Statistical Tests for Comparing Supervised Learning Algorithms.

Rodney Nielsen, Human Intelligence & Language Technologies Lab Leave-One-Out Cross-Validation ● Leave-One-Out, a particular form of cross- validation:  Set number of folds to number of training instances  I.e., for n training instances, build classifier n times ● Makes best use of the data ● Involves no random subsampling ● Can be very computationally expensive  But some exception (e.g., NN)

Rodney Nielsen, Human Intelligence & Language Technologies Lab The Bootstrap ● CV uses sampling without replacement ● The same instance, once selected, can not be selected again for a particular training/test set ● The bootstrap uses sampling with replacement to form the training set ● Sample a dataset of n instances n times with replacement to form a new dataset of n instances ● Use this data as the training set ● Use the instances from the original dataset that don’t occur in the new training set for testing

Rodney Nielsen, Human Intelligence & Language Technologies Lab x (10) x (4) x (9) x (1) x (8) x (4) x (9) x (10) x (2) The Bootstrap x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) x (10) Bootstrap Training Set Original Data Set 10 Total Examples 10 Examples in Bootstrap

Rodney Nielsen, Human Intelligence & Language Technologies Lab x (3) x (5) x (6) x (7) x (10) x (4) x (9) x (1) x (8) x (4) x (9) x (10) x (2) The Bootstrap x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) x (10) Test Set Bootstrap Training Set Original Data Set

Rodney Nielsen, Human Intelligence & Language Technologies Lab The Bootstrap ● Also called the bootstrap  A particular instance has a probability of (n-1)/n of not being picked  Thus its odds of ending up in the test data are:  This means the training data will contain approximately 63.2% of the instances

Rodney Nielsen, Human Intelligence & Language Technologies Lab Credibility: Evaluating What’s Been Learned ● Issues: training, testing, tuning ● Predicting performance ● Holdout, cross-validation, bootstrap ● Comparing schemes: the t-test ● Predicting probabilities: loss functions ● Cost-sensitive measures ● Evaluating numeric prediction ● The Minimum Description Length principle

Rodney Nielsen, Human Intelligence & Language Technologies Lab Comparing Data Mining Schemes ● Frequent question: which of two learning schemes performs better? ● Note: this is domain dependent! ● Obvious way: compare 10-fold CV estimates ● How effective would this be? ● We need to show convincingly that a particular method works better

Rodney Nielsen, Human Intelligence & Language Technologies Lab Comparing DM Schemes II Want to show that scheme A is better than scheme B in a particular domain For a given amount of training data On average, across all possible training sets Let's assume we have an infinite amount of data from the domain: Sample infinitely many dataset of specified size Obtain CV estimate on each dataset for each scheme Check if mean accuracy for scheme A is better than mean accuracy for scheme B

Rodney Nielsen, Human Intelligence & Language Technologies Lab Paired t-test ● In practice we have limited data and a limited number of estimates for computing the mean ● Student’s t-test tells whether the means of two samples are significantly different ● In our case the samples are cross-validation estimates for different datasets from the domain ● Use a paired t-test because the individual samples are paired  The same CV is applied twice William Gosset Born:1876 in Canterbury; Died: 1937 in Beaconsfield, England Obtained a post as a chemist in the Guinness brewery in Dublin in Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".

Rodney Nielsen, Human Intelligence & Language Technologies Lab Distribution of the Means ● x 1 x 2 … x k and y 1 y 2 … y k are the 2k samples for the k different datasets ● m x and m y are the means ● With enough samples, the mean of a set of independent samples is normally distributed ● Estimated variances of the means are  x 2 /k and  y 2 /k

Rodney Nielsen, Human Intelligence & Language Technologies Lab Student’s Distribution ● With small samples (k < 100) the mean follows Student’s distribution with k–1 degrees of freedom ● Confidence limits: % % 1.835% z 1% 0.5% 0.1% Pr[X  z] % % 1.655% z 1% 0.5% 0.1% Pr[X  z] 9 degrees of freedom normal distribution Assuming we have 10 estimates

Rodney Nielsen, Human Intelligence & Language Technologies Lab Distribution of the Differences ● Let m d = m x – m y ● The difference of the means (m d ) also has a Student’s distribution with k–1 degrees of freedom ● Let  d 2 be the variance of the difference ● The standardized version of m d is called the t- statistic: ● We use t to perform the t-test

Rodney Nielsen, Human Intelligence & Language Technologies Lab Performing the Test Fix a significance level If a difference is significant at the  % level, there is a (100-  )% chance that the true means differ Divide the significance level by two because the test is two-tailed I.e., either result could be greater than the other Look up the value for z that corresponds to  /2 If t  –z or t  z then the difference is significant I.e., the null hypothesis (that the difference is zero) can be rejected – there is less than a (100-  )% probability that the difference is due to chance / random factors

Rodney Nielsen, Human Intelligence & Language Technologies Lab Unpaired Observations ● If the CV estimates are from different datasets, they are no longer paired (or maybe we have k estimates for one scheme, and j estimates for the other scheme) ● Then we have to use an unpaired t-test with min(k, j) – 1 degrees of freedom ● The estimate of the variance of the difference of the means becomes:

Rodney Nielsen, Human Intelligence & Language Technologies Lab Dependent Estimates ● We assumed that we have enough data to create several datasets of the desired size ● Need to re-use data if that's not the case  E.g., running cross-validations with different randomizations on the same data ● Samples become dependent  insignificant differences can become significant ● A heuristic test is the corrected resampled t-test:  Assume we use the repeated hold-out method, with n 1 instances for training and n 2 for testing  New test statistic is:

Rodney Nielsen, Human Intelligence & Language Technologies Lab Credibility: Evaluating What’s Been Learned ● Issues: training, testing, tuning ● Predicting performance ● Holdout, cross-validation, bootstrap ● Comparing schemes: the t-test ● Predicting probabilities: loss functions ● Cost-sensitive measures ● Evaluating numeric prediction ● The Minimum Description Length principle