Empirical Research Methods in Computer Science Lecture 4 November 2, 2005 Noah Smith.

Empirical Research Methods in Computer Science Lecture 4 November 2, 2005 Noah Smith

Today Review bootstrap estimate of se (from homework). Review sign and permutation tests for paired samples. Lots of examples of hypothesis tests.

Recall... There is a true value of the statistic. But we don’t know it. We can compute the sample statistic. We know sample means are normally distrubuted (as n gets big):

But we don’t know anything about the distribution of other sample statistics (medians, correlations, etc.)!

Bootstrap world unknown distribution F observed random sample X statistic of interest empirical distribution bootstrap random sample X* bootstrap replication statistics about the estimate (e.g., standard error)

Bootstrap estimate of se Run B bootstrap replicates, and compute the statistic each time: θ*[1], θ*[2], θ*[3],..., θ*[B] (mean of θ* across replications) (sample standard deviation of θ* across replications)

Paired-Sample Design pairs (x i, y i ) x ~ distribution F y ~ distribution G How do F and G differ?

Sign Test H 0 : F and G have the same median median(F) – median(G) = 0 Pr(x > y) = 0.5 sign(x – y) ~ binomial distribution compute bin(N +, 0.5)

Sign Test nonparametric (no assumptions about the data) closed form (no random sampling)

Example: gzip speed build gzip with –O2 or with –O0 on about 650 files out of 1000, gzip-O2 was faster binomial distribution, p = 0.5, n = 1000 p < 3 x 10 -24

Permutation Test H 0 : F = G Suppose difference in sample means is d. How likely is this difference (or a greater one) under H 0 ? For i = 1 to P  Randomly permute each (x i, y i )  Compute difference in sample means

Permutation Test nonparametric (no assumptions about the data) randomized test

Example: gzip speed 1000 permutations: difference of sample means under H 0 is centered on 0 -1579 is very extreme; p ≈ 0

Comparing speed is tricky! It is very difficult to control for everything that could affect runtime. Solution 1: do the best you can. Solution 2: many runs, and then do ANOVA tests (or their nonparametric equivalents). “Is there more variance between conditions than within conditions?”

Sampling method 1 for r = 1 to 10  for each file f for each program p  time p on f

Result (gzip first) student 2’s program faster than gzip!

Result (student first) student 2’s program is slower than gzip!

Sampling method 1 for r = 1 to 10  for each file f for each program p  time p on f

Order effects Well-known in psychology. What the subject does at time t will affect what she does at time t+1.

Sampling method 2 for r = 1 to 10  for each program p for each file f  time p on f

Result gzip wins

Sign and Permutation Tests median(F)  median(G) all distribution pairs (F, G) F  G

Sign and Permutation Tests median(F)  median(G) all distribution pairs (F, G) F  G sign test rejects H 0 

Sign and Permutation Tests median(F)  median(G) all distribution pairs (F, G) F  G  permutation test rejects H 0

Sign and Permutation Tests median(F)  median(G) all distribution pairs (F, G) F  G  permutation test rejects H 0 sign test rejects H 0 

There are other tests! We have chosen two that are  nonparametric  easy to implement Others include:  Wilcoxon Signed Rank Test  Kruskal-Wallis (nonparametric “ANOVA”)

Pre-increment? Conventional wisdom: “Better to use ++x than to use x++.” Really, with a modern compiler?

Two (toy) programs for(i = 0; i < (1 << 30); ++i) j = ++k; for(i = 0; i < (1 << 30); i++) j = k++; ran each 200 times (interleaved) mean runtimes were 2.835 and 2.735 significant well below.05

What? leal -8(%ebp), %eax incl (%eax) movl -8(%ebp), %eax leal -8(%ebp), %edx incl (%edx) %edx is not used anywhere else

Conclusion Compile with –O and the assembly code is identical!

Why was this a dumb experiment?

Pre-increment, take 2 Take gzip source code. Replace all post-increments with pre-increments, in places where semantics won’t change. Run on 1000 files, 10 times each. Compare average runtime by file.

Sign test p = 8.5 x 10 -8

Permutation test

Conclusion Preincrementing is faster!... but what about –O?  sign test: p = 0.197  permutation test: p = 0.672 Preincrement matters without an optimizing compiler.

Your programs... 8 students had a working program both weeks. 6 people changed their code. 1 person changed nothing. 1 person changed to –O3. 3 people lossy in week 1. Everyone lossy in week 2!

Your programs! Was there an improvement on compression between the two versions? H 0 : No. Find sampling distribution of difference in means, using permutations.

Student 1 (lossless week 1)

Compression < 1?

Student 2: worse compression

Compression < 1?

Student 3

Student 6

Student 7

Student 8

Homework Assignment 2 6 experiments: 1. Does your program compress text or images better? 2. What about variance of compression? 3. What about gzip’s compression? 4. Variance of gzip’s compression? 5. Was there a change in the compression of your program from week 1 to week 2? 6. In the runtime?

Remainder of the course 11/9: EDA 11/16: Regression and learning 11/23: Happy Thanksgiving! 11/30: Statistical debugging 12/7: Review, Q&A Saturday 12/17, 2-5pm: Exam

Empirical Research Methods in Computer Science Lecture 4 November 2, 2005 Noah Smith.

Similar presentations

Presentation on theme: "Empirical Research Methods in Computer Science Lecture 4 November 2, 2005 Noah Smith."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Empirical Research Methods in Computer Science Lecture 4 November 2, 2005 Noah Smith.

Similar presentations

Presentation on theme: "Empirical Research Methods in Computer Science Lecture 4 November 2, 2005 Noah Smith."— Presentation transcript:

Similar presentations

About project

Feedback