Empirical Research Methods in Computer Science Lecture 4 November 2, 2005 Noah Smith
Today Review bootstrap estimate of se (from homework). Review sign and permutation tests for paired samples. Lots of examples of hypothesis tests.
Recall... There is a true value of the statistic. But we don’t know it. We can compute the sample statistic. We know sample means are normally distrubuted (as n gets big):
But we don’t know anything about the distribution of other sample statistics (medians, correlations, etc.)!
Bootstrap world unknown distribution F observed random sample X statistic of interest empirical distribution bootstrap random sample X* bootstrap replication statistics about the estimate (e.g., standard error)
Bootstrap estimate of se Run B bootstrap replicates, and compute the statistic each time: θ*[1], θ*[2], θ*[3],..., θ*[B] (mean of θ* across replications) (sample standard deviation of θ* across replications)
Paired-Sample Design pairs (x i, y i ) x ~ distribution F y ~ distribution G How do F and G differ?
Sign Test H 0 : F and G have the same median median(F) – median(G) = 0 Pr(x > y) = 0.5 sign(x – y) ~ binomial distribution compute bin(N +, 0.5)
Sign Test nonparametric (no assumptions about the data) closed form (no random sampling)
Example: gzip speed build gzip with –O2 or with –O0 on about 650 files out of 1000, gzip-O2 was faster binomial distribution, p = 0.5, n = 1000 p < 3 x
Permutation Test H 0 : F = G Suppose difference in sample means is d. How likely is this difference (or a greater one) under H 0 ? For i = 1 to P Randomly permute each (x i, y i ) Compute difference in sample means
Permutation Test nonparametric (no assumptions about the data) randomized test
Example: gzip speed 1000 permutations: difference of sample means under H 0 is centered on is very extreme; p ≈ 0
Comparing speed is tricky! It is very difficult to control for everything that could affect runtime. Solution 1: do the best you can. Solution 2: many runs, and then do ANOVA tests (or their nonparametric equivalents). “Is there more variance between conditions than within conditions?”
Sampling method 1 for r = 1 to 10 for each file f for each program p time p on f
Result (gzip first) student 2’s program faster than gzip!
Result (student first) student 2’s program is slower than gzip!
Sampling method 1 for r = 1 to 10 for each file f for each program p time p on f
Order effects Well-known in psychology. What the subject does at time t will affect what she does at time t+1.
Sampling method 2 for r = 1 to 10 for each program p for each file f time p on f
Result gzip wins
Sign and Permutation Tests median(F) median(G) all distribution pairs (F, G) F G
Sign and Permutation Tests median(F) median(G) all distribution pairs (F, G) F G sign test rejects H 0
Sign and Permutation Tests median(F) median(G) all distribution pairs (F, G) F G permutation test rejects H 0
Sign and Permutation Tests median(F) median(G) all distribution pairs (F, G) F G permutation test rejects H 0 sign test rejects H 0
There are other tests! We have chosen two that are nonparametric easy to implement Others include: Wilcoxon Signed Rank Test Kruskal-Wallis (nonparametric “ANOVA”)
Pre-increment? Conventional wisdom: “Better to use ++x than to use x++.” Really, with a modern compiler?
Two (toy) programs for(i = 0; i < (1 << 30); ++i) j = ++k; for(i = 0; i < (1 << 30); i++) j = k++; ran each 200 times (interleaved) mean runtimes were and significant well below.05
What? leal -8(%ebp), %eax incl (%eax) movl -8(%ebp), %eax leal -8(%ebp), %edx incl (%edx) %edx is not used anywhere else
Conclusion Compile with –O and the assembly code is identical!
Why was this a dumb experiment?
Pre-increment, take 2 Take gzip source code. Replace all post-increments with pre-increments, in places where semantics won’t change. Run on 1000 files, 10 times each. Compare average runtime by file.
Sign test p = 8.5 x 10 -8
Permutation test
Conclusion Preincrementing is faster!... but what about –O? sign test: p = permutation test: p = Preincrement matters without an optimizing compiler.
Joke.
Your programs... 8 students had a working program both weeks. 6 people changed their code. 1 person changed nothing. 1 person changed to –O3. 3 people lossy in week 1. Everyone lossy in week 2!
Your programs! Was there an improvement on compression between the two versions? H 0 : No. Find sampling distribution of difference in means, using permutations.
Student 1 (lossless week 1)
Compression < 1?
Student 2: worse compression
Compression < 1?
Student 3
Student 4 (lossless week 1)
Student 5 (lossless week 1)
Student 6
Student 7
Student 8
Homework Assignment 2 6 experiments: 1. Does your program compress text or images better? 2. What about variance of compression? 3. What about gzip’s compression? 4. Variance of gzip’s compression? 5. Was there a change in the compression of your program from week 1 to week 2? 6. In the runtime?
Remainder of the course 11/9: EDA 11/16: Regression and learning 11/23: Happy Thanksgiving! 11/30: Statistical debugging 12/7: Review, Q&A Saturday 12/17, 2-5pm: Exam