Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido
Introduction Resampling methods: The Holdout Crossvalidation Random subsampling kfold Crossvalidation Leaveoneout The Bootstrap Error evaluation Accuracy and all that Error estimation
Bias and variance estimates with the bootstrap
Example: estimating bias & variance
Three-way data splits (1)
Three-way data splits (2)
Summary (data sample of size n) Resubstitution: optimistically-biased estimate especially when the ratio of n to dimension is small Holdout (if iterated we get Random subsampling): pessimistically-biased estimate different partitions yield different estimates K-fold CV (K«n): higher bias than LOOCV; lower than holdout lower variance than LOOCV LOOCV (n-fold CV): unbiased - large variance Bootstrap: lower variance than LOOCV useful for very small n Computational burden
Error Evaluation Given: Hypothesis h(x): X C, in hypothesis space H, mapping features x to a number of classes A data sample S of size n Questions: What is the error of h on unseen data? If we have two competing hypotheses, which one will be better on unseen data? How do we compare two learning algorithms in the face of limited data? How certain are we about the answers to these questions?
Apparent & True Error We can define two errors: 1) Error(h|S) is the apparent error, measured on the sample S: 2) Error(h|P) is the true error on data sampled from the distribution P(x): where f(x) is the true hypothesis.
A note on True Error True Error need not be zero! Not even if we knew the probabilities P(x) Causes: Lack of relevant features Intrinsic randomness of the process A consequence of this is that we shall not attempt to fit hypotheses with zero apparent error, ie error(h|S)=0 !!! Quite on the contrary, we should favor those hypotheses s.t. error(h|S) ≈ error(h|P) If error(h|S) >> error(h|P), then h is underfitting the sample S If error(h|S) << error(h|P), then h is overfitting the sample S
How to estimate True Error (te)? Estimate te as te in TE Note te is a r.v. CI Let TE - the subset of TE wrongly predicted by h Let n = |S|, t = |TE| |TE - | follows a binomial distribution B(te, t) S The ML estimation of te is te = |TE-| / t This estimator is unbiased: E[te] = te Var[te] = te(1–te)/t
Confidence Intervals for te te – s ≤ te ≤ te + s where s = z N √(te(1–te)/t) In words, te is within z N standard errors of the estimation. This is because, for te(1–te)t>5 or t>30, it is safe to approximate a Binomial by a Gaussian, for which we can compute “z-values”. “With N% confidence te=error(h|P) is contained in the interval:” 80% Normal(0,1)
Example 1 n = |S| = 1,000; t = |TE| = 250 (25% of S) Suppose |TE - | = 50 (our h hits 80% of TE) Then te = 0.2. For a CI at the 95% level: z 0.95 = and te is in [0.15, 0.25] Exercise: recompute CI at the 99% level, using z 0.99 = 2.326
Example 2: comparing two hypotheses Assume we need to compare 2 hypotheses h 1 and h 2 on the same data We have t = |TE| = 100, on which h 1 makes 10 errors and h 2 makes 13 The CIs at the 95% (α=0.05) level are: [0.04, 0.16] for h 1 [0.06, 0.20] for h 2 We cannot conclude that h 1 is better than h 2 Note: above is written 10%±6% ( h 1 ), 13%±7% ( h 2 )
Size does matter after all … How large would TE need to be (say T) to affirm that h 1 is better than h 2 ? Assume both h 1, h 2 keep same accuracy Force that UL of CI for h 1 < LL of CI for h 2 UL of CI for h 1 is √(0.1*0.9 / T) LL of CI for h 2 is 0.13 – 1.967√(0.13*0.87/ T) It turns out that T>1,742 (old size was 100!!!) The probability that this fails is at most (1-α)/2
Paired t-test Chunk the data set S up in subsets s 1,...,s k with |s i |>30 Design classifiers h 1, h 2 on every S\s i On each subset s i compute the errors and define: Now compute: With N% confidence the difference in error between h 1 and h 2 is: “t N,k-1 ” is the t-statistic related to the student-t distribution Since error( h 1 | s i ) and error( h 2 | s i ) are both approximately Normal their difference is approximately Normal
Exercise: the real case … A team of doctors has own classifier and sample data of size 500 Split it in TR of size 300 and TE of size 200 They get an error of 22% on TE They ask us for further advice … We design a second classifier It has an error of 15% on same TE
Answer the following questions: 1.Will you affirm that yours is better than theirs? 2.How large would TE need to be to (very reasonably) affirm that yours is better than theirs? 3.What do you deduce from the above? 4.Suppose we move to 10-fold CV on the entire data set. 1.Give a new estimation of the error of your classifier 2.Perform a statistical test to check if there is any real difference The doctors’ classifier errors: 0.22, 0.22, 0.29, 0.19, 0.23, 0.22, 0.20, 0.25, 0.19, 0.19 Your classifier’ errors: 0.15, 0.17, 0.21, 0.14, 0.13, 0.15, 0.14, 0.19, 0.11, 0.11 Exercise: the real case …
What is Accuracy? Accuracy = No. of correct predictions No. of predictions = TP + TN TP + TN + FP + FN
Example Clearly, B, C, D are all better than A Is B better than C, D? Is C better than B, D? Is D better than B, C? Accuracy may not tell the whole story
What is Sensitivity (aka Recall)? Sensitivity = No. of correct positive predictions No. of positives = TP TP + FN (wrt positives) Sometimes sensitivity wrt negatives is termed specificity
What is Specificity (aka Precision)? Precision = No. of correct positive predictions No. of positive predictions = TP TP + FP wrt positives
Precision-Recall Trade-off A predicts better than B if A has better recall and precision than B There is a trade-off between recall and precision In some applications, once you reach a satisfactory precision, you optimize for recall In some applications, once you reach a satisfactory recall, you optimize for precision recall precision
Comparing prediction performance Accuracy is the obvious measure But it conveys the right intuition only when the positive and negative populations are roughly equal in size Recall and precision together form a better measure But what do you do when A has better recall than B and B has better precision than A?
F-measure The harmonic mean of recall and precision F = 2 * recall * precision recall + precision (wrt positives) Does not accord with intuition
Abstract model of a classifier Given a test observation x Compute the prediction h(x) Predict x as negative if h(x) < t Predict x as positive if h(x) > t t is the decision threshold of the classifier changing t affects the recall and precision, and hence accuracy, of the classifier
ROC Curves By changing t, we get a range of sensitivities and specificities of a classifier Leads to ROC curve that plots sensitivity vs. (1 – specificity) A predicts better than B if A has better sensitivities than B at most specificities Then the larger the area under the ROC curve, the better sensitivity 1 – specificity P(TP) P(FP)