Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.


Evaluation (practice)

2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount of test data  Prediction is just like tossing a (biased!) coin  “Head” is a “success”, “tail” is an “error”  In statistics, a succession of independent events like this is called a Bernoulli process  Statistical theory provides us with confidence intervals for the true underlying proportion

3 Confidence intervals  We can say: p lies within a certain specified interval with a certain specified confidence  Example: S=750 successes in N=1000 trials  Estimated success rate: 75%  How close is this to true success rate p?  Answer: with 80% confidence p  [73.2,76.7]  Another example: S=75 and N=100  Estimated success rate: 75%  With 80% confidence p  [69.1,80.1]  I.e. the probability that p  [69.1,80.1] is 0.8.  Bigger the N more confident we are, i.e. the surrounding interval is smaller.  Above, for N=100 we were less confident than for N=1000.

4 Mean and Variance  Let Y be the random variable with possible values 1 for success and 0 for error.  Let probability of success be p.  Then probability of error is q=1-p.  What’s the mean? 1*p + 0*q = p  What’s the variance? (1-p) 2 *p + (0-p) 2 *q = q 2 *p+p 2 *q = pq(p+q) = pq

5 Estimating p  Well, we don’t know p. Our goal is to estimate p.  For this we make N trials, i.e. tests.  More trials we do more confident we are.  Let S be the random variable denoting the number of successes, i.e. S is the sum of N value samplings of Y.  Now, we approximate p with the success rate in N trials, i.e. S/N.  By the Central Limit Theorem, when N is big, the probability distribution of the random variable f=S/N is approximated by a normal distribution with  mean p and  variance pq/N.

6 Estimating p  c% confidence interval [–z ≤ X ≤ z] for random variable with 0 mean is given by: Pr[− z≤ X≤ z]= c  With a symmetric distribution: Pr[− z≤ X≤ z]=1−2× Pr[ x≥ z]  Confidence limits for the normal distribution with 0 mean and a variance of 1: Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable f=S/N to have 0 mean and unit variance

7 Estimating p Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable S/N to have 0 mean and unit variance: Pr[−1.65≤ (S/N – p) /  S/N ≤1.65]=90% Now we solve two equations: (S/N – p) /  S/N =1.65 (S/N – p) /  S/N =-1.65

8 Estimating p Let N=100, and S=70  S/N is sqrt( pq/N ) and we approximate it by sqrt(p'(1-p')/N) where p' is the estimation of p, i.e. 0.7 So,  S/N is approximated by sqrt(.7*.3/100) =.046 The two equations become: (0.7 – p) /.046 =1.65 p = *.046 =.624 (0.7 – p) /.046 =-1.65 p = *.046 =.776 Thus, we say: With a 90% confidence we have that the success rate p of the classifier will be  p  0.776

9 Exercise  Suppose I want to be 95% confident in my estimation.  Looking at a detailed table we find: Pr[−2≤ X≤2]  95%  Normalizing S/N, we need to solve: (S/N – p) /  f =2 (S/N – p) /  f =-2  We approximate  f with  where p' is the estimation of p through trials, i.e. S/N  So we need to solve:  So,

10 Exercise  Suppose N=1000 trials, S=590 successes  p'=S/N=590/1000 =.59

11 Cross-validation k-fold cross-validation: First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training  The error estimates are averaged to yield an overall error estimate

12 Comparing data mining Schemes  Frequent question: which of two learning schemes performs better?  Obvious way: compare for example 10-fold Cross Validation estimates  Problem: variance in estimate  We don’t know whether the results are reliable  need to use statistical-test for that

13 Paired t-test  Student’s t-test tells whether the means of two samples are significantly different.  In our case the samples are cross-validation estimates for different datasets from the domain  Use a paired t-test because the individual samples are paired  The same Cross Validation is applied twice

14 Distribution of the means  x 1, x 2, … x k and y 1, y 2, … y k are the 2k samples for the k different datasets  m x and m y are the means  With enough samples, the mean of a set of independent samples is normally distributed  Estimated variances of the means are  s x 2 / k and s y 2 / k  If  x and  y are the true means then the following are approximately normally distributed with mean 0, and variance 1:

15 Student’s distribution  With small samples (k < 30) the mean follows Student’s distribution with k–1 degrees of freedom  similar shape, but wider than normal distribution  Confidence limits (mean 0 and variance 1):

16 Distribution of the differences  Let m d = m x – m y  The difference of the means (m d ) also has a Student’s distribution with k–1 degrees of freedom  Let s d 2 be the estimated variance of the difference  The standardized version of m d is called the t-statistic:

17 Performing the test  Fix a significance level  If a difference is significant at the  % level, there is a (100-  )% chance that the true means differ  Divide the significance level by two because the test is two-tailed  Look up the value for z that corresponds to  /2  If t  –z or t  z then the difference is significant  I.e. the null hypothesis (that the difference is zero) can be rejected

18 Example  We have compared two classifiers through cross- validation on 10 different datasets.  The error rates are: Dataset Classifier A Classifier B Difference

19 Example  m d = 0.48  s d =  The critical value of t for a two-tailed statistical test,  = 10% and 9 degrees of freedom is: 1.83  is way bigger than 1.83, so classifier B is much better than A.

20 Dependent estimates  We assumed that we have enough data to create  several datasets of the desired size  Need to reuse data if that's not the case  E.g. running cross-validations with different randomizations on the same data  Samples become dependent  insignificant differences can become significant  A heuristic test is the corrected resampled t-test:  Assume we use the repeated holdout method, with n 1 instances for training and n 2 for testing  New test statistic is: