Evaluation.

Slides:



Advertisements
Similar presentations
Hypothesis Testing. To define a statistical Test we 1.Choose a statistic (called the test statistic) 2.Divide the range of possible values for the test.
Advertisements

Evaluating Classifiers
Comparing One Sample to its Population
Learning Algorithm Evaluation
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 9 Hypothesis Testing Developing Null and Alternative Hypotheses Developing Null and.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
Sample size computations Petter Mostad
Model Evaluation Instructor: Qiang Yang
Topic 2: Statistical Concepts and Market Returns
Evaluating Hypotheses
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
Evaluation of Learning Models
Review of normal distribution. Exercise Solution.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Evaluating Classifiers
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
AM Recitation 2/10/11.
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
STATISTICAL INFERENCE PART VII
Slides for “Data Mining” by I. H. Witten and E. Frank.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
One-sample In the previous cases we had one sample and were comparing its mean to a hypothesized population mean However in many situations we will use.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
Essential Question:  How do scientists use statistical analyses to draw meaningful conclusions from experimental results?
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) x ±  x and x ±  y What.
Evaluating Results of Learning Blaž Zupan
CpSc 881: Machine Learning Evaluating Hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
Review - Confidence Interval Most variables used in social science research (e.g., age, officer cynicism) are normally distributed, meaning that their.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
Chapter 9 Introduction to the t Statistic
Class Six Turn In: Chapter 15: 30, 32, 38, 44, 48, 50 Chapter 17: 28, 38, 44 For Class Seven: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 Read.
Data Science Credibility: Evaluating What’s Been Learned
Evaluating Results of Learning
9. Credibility: Evaluating What’s Been Learned
Machine Learning Techniques for Data Mining
Learning Algorithm Evaluation
Presentation transcript:

Evaluation

Evaluation How predictive is the model we learned? Error on the training data is not a good indicator of performance on future data Simple solution that can be used if lots of (labeled) data is available: Split data into training and test set However: (labeled) data is usually limited More sophisticated techniques need to be used

Training and testing Natural performance measure for classification problems: error rate Success: instance’s class is predicted correctly Error: instance’s class is predicted incorrectly Error rate: proportion of errors made over the whole set of instances Resubstitution error: error rate obtained from training data Resubstitution error is (hopelessly) optimistic! Test set: independent instances that have played no part in formation of classifier Assumption: both training data and test data are representative samples of the underlying problem

Making the most of the data Once evaluation is complete, all the data can be used to build the final classifier Generally, the larger the training data the better the classifier The larger the test data the more accurate the error estimate Holdout procedure: method of splitting original data into training and test set Dilemma: ideally both training set and test set should be large!

Predicting performance Assume the estimated error rate is 25%. How close is this to the true error rate? Depends on the amount of test data Prediction is just like tossing a (biased!) coin “Head” is a “success”, “tail” is an “error” In statistics, a succession of independent events like this is called a Bernoulli process Statistical theory provides us with confidence intervals for the true underlying proportion

Confidence intervals We can say: p lies within a certain specified interval with a certain specified confidence Example: S=750 successes in N=1000 trials Estimated success rate: 75% How close is this to the true success rate p? Answer: with 80% confidence p[73.2,76.7] Another example: S=75 and N=100 With 80% confidence p[69.1,80.1] I.e. the probability that p[69.1,80.1] is 0.8. Bigger the N more confident we are, i.e. the surrounding interval is smaller. Above, for N=100 we were less confident than for N=1000.

Mean and Variance Let Y be the random variable with possible values 1 for success and 0 for error. Let probability of success be p. Then probability of error is q=1-p. What’s the mean? 1*p + 0*q = p What’s the variance? (1-p)2*p + (0-p)2*q = q2*p+p2*q = pq(p+q) = pq

Estimating p Well, we don’t know p. Our goal is to estimate p. For this we make N trials, i.e. tests. More trials we do more confident we are. Let S be the random variable denoting the number of successes, i.e. S is the sum of N value samplings of Y. Now, we approximate p with the success rate in N trials, i.e. S/N. By the Central Limit Theorem, when N is big, the probability distribution of the random variable f=S/N is approximated by a normal distribution with mean p and variance pq/N.

Estimating p c% confidence interval [–z ≤ X ≤ z] for random variable with 0 mean is given by: Pr[− z≤ X≤ z]= c With a symmetric distribution: Pr[− z≤ X≤ z]=1−2× Pr[ x≥ z] Confidence limits for the normal distribution with 0 mean and a variance of 1: Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable f=S/N to have 0 mean and unit variance

Estimating p Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable S/N to have 0 mean and unit variance: Pr[−1.65≤ (S/N – p) / S/N ≤1.65]=90% Now we solve two equations: (S/N – p) / S/N =1.65 (S/N – p) / S/N =-1.65

Estimating p Let N=100, and S=70 S/N is sqrt(pq/N) and we approximate it by sqrt(p'(1-p')/N) where p' is the estimation of p, i.e. 0.7 So, S/N is approximated by sqrt(.7*.3/100) = .046 The two equations become: (0.7 – p) / .046 =1.65 p = .7 - 1.65*.046 = .624 (0.7 – p) / .046 =-1.65 p = .7 + 1.65*.046 = .776 Thus, we say: With a 90% confidence we have that the success rate p of the classifier will be 0.624  p  0.776

Exercise Suppose I want to be 95% confident in my estimation. Looking at a detailed table we find: Pr[−2≤ X≤2]  95% Normalizing S/N, we need to solve: (S/N – p) / S/N =2 (S/N – p) / S/N =-2 We approximate S/N with where p' is the estimation of p through trials, i.e. S/N So we need to solve: So,

Exercise Suppose N=1000 trials, S=590 successes p'=S/N=590/1000 = .59

Holdout estimation What to do if the amount of data is limited? The holdout method reserves a certain amount for testing and uses the remainder for training Usually: one third for testing, the rest for training Problem: the samples might not be representative Example: class might be missing in the test data Advanced version uses stratification Ensures that each class is represented with approximately equal proportions in both subsets

Repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamples In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation) The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method Still not optimum: the different test sets overlap Can we prevent overlapping?

Cross-validation Cross-validation avoids overlapping test sets First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training Called k-fold cross-validation Often the subsets are stratified before the cross-validation is performed The error estimates are averaged to yield an overall error estimate Standard method for evaluation: stratified 10-fold cross-validation

Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation: Set number of folds to number of training instances I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling But, computationally expensive

Leave-One-Out-CV and stratification Disadvantage of Leave-One-Out-CV: stratification is not possible It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example: completely random dataset split equally into two classes Best inducer predicts majority class 50% accuracy on fresh data Leave-One-Out-CV estimate is 100% error!

The bootstrap Cross Validation uses sampling without replacement The same instance, once selected, can not be selected again for a particular training/test set The bootstrap uses sampling with replacement to form the training set Sample a dataset of n instances n times with replacement to form a new dataset of n instances Use this data as the training set Use the instances from the original dataset that don’t occur in the new training set for testing Also called the 0.632 bootstrap (Why?)

The 0.632 bootstrap Also called the 0.632 bootstrap A particular instance has a probability of 1–1/n of not being picked Thus its probability of ending up in the test data is: This means the training data will contain approximately 63.2% of the instances

Estimating error with the bootstrap The error estimate on the test data will be very pessimistic Trained on just ~63% of the instances Therefore, combine it with the resubstitution error: The resubstitution error gets less weight than the error on the test data Repeat process several times with different replacement samples; average the results Probably the best way of estimating performance for very small datasets

Comparing data mining schemes Frequent question: which of two learning schemes performs better? Note: this is domain dependent! Obvious way: compare 10-fold Cross Validation estimates Problem: variance in estimate Variance can be reduced using repeated CV However, we still don’t know whether the results are reliable (need to use statistical-test for that)

Counting the cost In practice, different types of classification errors often incur different costs Examples: Cancer Detection “Not cancer” correct 99% of the time Loan decisions Fault diagnosis Promotional mailing

Counting the cost The confusion matrix: Predicted class Yes No Actual class True positive False negative False positive True negative

Paired t-test Student’s t-test tells whether the means of two samples are significantly different. In our case the samples are cross-validation samples for different datasets from the domain Use a paired t-test because the individual samples are paired The same Cross Validation is applied twice

Distribution of the means x1, x2, … xk and y1, y2, … yk are the 2k samples for the k different datasets mx and my are the means With enough samples, the mean of a set of independent samples is normally distributed Estimated variances of the means are sx2 / k and sy2 / k If x and y are the true means then the following are approximately normally distributed with mean 0, and variance 1:

Distribution of the differences Let md = mx – my The difference of the means (md) has a Student’s distribution with k–1 degrees of freedom Let sd2 be the estimated variance of the difference The standardized version of md is called the t-statistic:

Performing the test Fix a significance level If a difference is significant at the  % level, there is a (100- )% chance that the true means differ Divide the significance level by two because the test is two-tailed Look up the value for z that corresponds to /2 If t–z or tz then the difference is significant I.e. the null hypothesis (that the difference is zero) can be rejected

Example We have compared two classifiers through cross-validation on 10 different datasets. The error rates are: Dataset Classifier A Classifier B Difference 1 10.6 10.2 .4 2 9.8 9.4 .4 3 12.3 11.8 .5 4 9.7 9.1 .6 5 8.8 8.3 .5 6 10.6 10.2 .4 7 9.8 9.4 .4 8 12.3 11.8 .5 9 9.7 9.1 .6 10 8.8 8.3 .5

Example md = 0.48 sd = 0.0789 The critical value of t for a two-tailed statistical test,  = 10% and 9 degrees of freedom is: 1.83 19.24 is way bigger than 1.83, so classifier B is much better than A.