Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Statistical Significance and Performance Measures
Learning Algorithm Evaluation
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
Model generalization Test error Bias, variance and complexity
Lecture 22: Evaluation April 24, 2010.
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
Evaluating Hypotheses. Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 2 Some notices Exam can be made in Artificial Intelligence (Department.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
© sebastian thrun, CMU, The KDD Lab Intro: Outcome Analysis Sebastian Thrun Carnegie Mellon University
Evaluation.
Evaluating Hypotheses
Tuesday, October 22 Interval estimation. Independent samples t-test for the difference between two means. Matched samples t-test.
Evaluating Classifiers Lecture 2 Instructor: Max Welling.
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
BCOR 1020 Business Statistics
5-3 Inference on the Means of Two Populations, Variances Unknown
INTRODUCTION TO Machine Learning 3rd Edition
Evaluation of Learning Models
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
AM Recitation 2/10/11.
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Experimental Evaluation of Learning Algorithms Part 1.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
CpSc 881: Machine Learning Evaluating Hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
Chapter 8 Parameter Estimates and Hypothesis Testing.
Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
7. Performance Measurement
Evaluating Results of Learning
Evaluating Classifiers
Empirical Evaluation (Ch 5)
Chapter 9 Hypothesis Testing.
Learning Algorithm Evaluation
Presentation transcript:

Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido

Introduction Resampling methods: The Holdout Cross­validation Random subsampling k­fold Cross­validation Leave­one­out The Bootstrap Error evaluation Accuracy and all that Error estimation

Bias and variance estimates with the bootstrap

Example: estimating bias & variance

Three-way data splits (1)

Three-way data splits (2)

Summary (data sample of size n) Resubstitution:  optimistically-biased estimate  especially when the ratio of n to dimension is small Holdout (if iterated we get Random subsampling):  pessimistically-biased estimate  different partitions yield different estimates K-fold CV (K«n):  higher bias than LOOCV; lower than holdout  lower variance than LOOCV LOOCV (n-fold CV): unbiased - large variance Bootstrap:  lower variance than LOOCV  useful for very small n Computational burden

Error Evaluation Given: Hypothesis h(x): X  C, in hypothesis space H, mapping features x to a number of classes A data sample S of size n Questions: What is the error of h on unseen data? If we have two competing hypotheses, which one will be better on unseen data? How do we compare two learning algorithms in the face of limited data? How certain are we about the answers to these questions?

Apparent & True Error We can define two errors: 1) Error(h|S) is the apparent error, measured on the sample S: 2) Error(h|P) is the true error on data sampled from the distribution P(x): where f(x) is the true hypothesis.

A note on True Error True Error need not be zero!  Not even if we knew the probabilities P(x) Causes:  Lack of relevant features  Intrinsic randomness of the process  A consequence of this is that we shall not attempt to fit hypotheses with zero apparent error, ie error(h|S)=0 !!! Quite on the contrary, we should favor those hypotheses s.t. error(h|S) ≈ error(h|P) If error(h|S) >> error(h|P), then h is underfitting the sample S If error(h|S) << error(h|P), then h is overfitting the sample S

How to estimate True Error (te)? Estimate te as te in TE Note te is a r.v.  CI Let TE - the subset of TE wrongly predicted by h Let n = |S|, t = |TE| |TE - | follows a binomial distribution B(te, t) S The ML estimation of te is te = |TE-| / t This estimator is unbiased: E[te] = te Var[te] = te(1–te)/t

Confidence Intervals for te te – s ≤ te ≤ te + s where s = z N √(te(1–te)/t) In words, te is within z N standard errors of the estimation. This is because, for te(1–te)t>5 or t>30, it is safe to approximate a Binomial by a Gaussian, for which we can compute “z-values”. “With N% confidence te=error(h|P) is contained in the interval:” 80% Normal(0,1)

Example 1 n = |S| = 1,000; t = |TE| = 250 (25% of S) Suppose |TE - | = 50 (our h hits 80% of TE) Then te = 0.2. For a CI at the 95% level:  z 0.95 = and te is in [0.15, 0.25]  Exercise: recompute CI at the 99% level, using z 0.99 = 2.326

Example 2: comparing two hypotheses Assume we need to compare 2 hypotheses h 1 and h 2 on the same data We have t = |TE| = 100, on which h 1 makes 10 errors and h 2 makes 13 The CIs at the 95% (α=0.05) level are:  [0.04, 0.16] for h 1  [0.06, 0.20] for h 2 We cannot conclude that h 1 is better than h 2 Note: above is written 10%±6% ( h 1 ), 13%±7% ( h 2 )

Size does matter after all … How large would TE need to be (say T) to affirm that h 1 is better than h 2 ? Assume both h 1, h 2 keep same accuracy Force that UL of CI for h 1 < LL of CI for h 2  UL of CI for h 1 is √(0.1*0.9 / T)  LL of CI for h 2 is 0.13 – 1.967√(0.13*0.87/ T)  It turns out that T>1,742 (old size was 100!!!) The probability that this fails is at most (1-α)/2

Paired t-test Chunk the data set S up in subsets s 1,...,s k with |s i |>30 Design classifiers h 1, h 2 on every S\s i On each subset s i compute the errors and define: Now compute: With N% confidence the difference in error between h 1 and h 2 is: “t N,k-1 ” is the t-statistic related to the student-t distribution Since error( h 1 | s i ) and error( h 2 | s i ) are both approximately Normal their difference is approximately Normal

Exercise: the real case … A team of doctors has own classifier and sample data of size 500  Split it in TR of size 300 and TE of size 200  They get an error of 22% on TE  They ask us for further advice … We design a second classifier  It has an error of 15% on same TE

Answer the following questions: 1.Will you affirm that yours is better than theirs? 2.How large would TE need to be to (very reasonably) affirm that yours is better than theirs? 3.What do you deduce from the above? 4.Suppose we move to 10-fold CV on the entire data set. 1.Give a new estimation of the error of your classifier 2.Perform a statistical test to check if there is any real difference The doctors’ classifier errors: 0.22, 0.22, 0.29, 0.19, 0.23, 0.22, 0.20, 0.25, 0.19, 0.19 Your classifier’ errors: 0.15, 0.17, 0.21, 0.14, 0.13, 0.15, 0.14, 0.19, 0.11, 0.11 Exercise: the real case …

What is Accuracy? Accuracy = No. of correct predictions No. of predictions = TP + TN TP + TN + FP + FN

Example Clearly, B, C, D are all better than A Is B better than C, D? Is C better than B, D? Is D better than B, C? Accuracy may not tell the whole story

What is Sensitivity (aka Recall)? Sensitivity = No. of correct positive predictions No. of positives = TP TP + FN (wrt positives) Sometimes sensitivity wrt negatives is termed specificity

What is Specificity (aka Precision)? Precision = No. of correct positive predictions No. of positive predictions = TP TP + FP wrt positives

Precision-Recall Trade-off A predicts better than B if A has better recall and precision than B There is a trade-off between recall and precision In some applications, once you reach a satisfactory precision, you optimize for recall In some applications, once you reach a satisfactory recall, you optimize for precision recall precision

Comparing prediction performance Accuracy is the obvious measure  But it conveys the right intuition only when the positive and negative populations are roughly equal in size Recall and precision together form a better measure  But what do you do when A has better recall than B and B has better precision than A?

F-measure The harmonic mean of recall and precision F = 2 * recall * precision recall + precision (wrt positives) Does not accord with intuition

Abstract model of a classifier Given a test observation x Compute the prediction h(x) Predict x as negative if h(x) < t Predict x as positive if h(x) > t t is the decision threshold of the classifier changing t affects the recall and precision, and hence accuracy, of the classifier

ROC Curves By changing t, we get a range of sensitivities and specificities of a classifier Leads to ROC curve that plots sensitivity vs. (1 – specificity) A predicts better than B if A has better sensitivities than B at most specificities Then the larger the area under the ROC curve, the better sensitivity 1 – specificity P(TP) P(FP)