Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh
Overview How can we tell if one algorithm can learn better than another? – Design an experiment to measure the accuracy of the two algorithms. – Run multiple trials. – Compare the samples - not just their means: Do a statistically sound test of the two samples. – Is any observed difference significant? Is it due to true difference between algorithms or natural variation in the measurements?
Statistical Hypothesis Testing Statistical Hypothesis: A statement about the parameters of one or more populations Hypothesis Testing: A procedure for deciding to accept or reject the hypothesis – Identify the parameter of interest – State a null hypothesis, H 0 – Specify an alternate hypothesis, H 1 – Choose a significance level α – State an appropriate test statistic
Statistical Hypothesis Testing Cont Null Hypothesis (H 0 ): A statement presumed to be true until statistical evidence shows otherwise Usually specifies an exact value for a parameter Example H 0 : µ = 30 Kg Alternate Hypothesis (H 1 ): Accepted if the null hypothesis is rejected Test Statistic: Particular statistic calculated from measurements of a random sample / experiment – A test statistic is assumed to follow a particular distribution (normal, t, chi-square, etc) – That particular distribution can be used to test for the significance of the calculated test statistic.
Error in Hypothesis Testing Type I error occurs when H 0 is rejected but it is in fact true – P( Type I error )=α or significance level Type II error occurs when we fail to reject H 0 but it is in fact false – P( Type II error )= β – power = 1-β = Probability of correctly rejecting H 0 – power = ability to distinguish between the two populations
Paired t-Test Collect data in pairs: – Example: Given a training set D Train and a test set D Test, train both learning algorithms on D Train and then test their accuracies on D Test. Suppose n paired measurements have been made Assume – The measurements are independent – The measurements for each algorithm follow a normal distribution The test statistic T 0 will follow a t-distribution with n-1 degrees of freedom
Paired t-Test cont Trial # Algorithm 1 Accuracy X 1 Algorithm 2 Accuracy X 2 1X 11 X 21 2X 12 X 22 …..… nX 1N X 2N Null Hypothesis: H 0 : µ D = Δ 0 Test Statistic: Assume: X 1 follows N(µ 1,σ 1 ) X 2 follows N(µ 2,σ 2 ) Let:µ D = µ 1 - µ 2 D i = X 1i - X 2i i=1,2,...,n Rejection Criteria: H 1 : µ D ≠ Δ 0 |t 0 | > t α/2,n-1 H 1 : µ D > Δ 0 t 0 > t α,n-1 H 1 : µ D < Δ 0 t 0 < -t α,n-1
Cross Validated t-test Paired t-Test on the 10 paired accuracies obtained from 10-fold cross validation Advantages – Large train set size – Most powerful (Diettrich, 98) Disadvantages – Accuracy results are not independent (overlap) – Somewhat elevated probability of type-1 error (Diettrich, 98) …
5x2 Cross Validated t-test Run 2-fold cross validation 5 times Use results from the first of five replications to estimate mean difference Use results for all folds to estimate the variance Advantage: – Lowest Type-1 error (Diettrich, 98) Disadvantage – Not as powerful as 10 fold cross validated t-test (Diettrich, 98)
Re-sampled t-test Randomly divide data into train / test sets (usually 2/3 – 1/3) Run multiple trials (usually 30) Perform a paired t-test between the trial accuracies This test has very high probability of type-1 error and should never be used.
Calibrated Tests Bouckaert – ICML 2003: – It is very difficult to estimate the true degrees of freedom because independence assumptions are being violated – Instead of correcting for the mean-difference, calibrate on the degrees of freedom – Recommendation: use 10 times repeated 10-fold cross validation with 10 degrees of freedom
References R. R. Bouckaert. Choosing between two learning algorithms based on calibrated tests. ICML’03: PP T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1895–1924, D. C. Montgomery et al. Engineering Statistics. 2nd Edition. Wiley Press. 2001