Statistical Significance Test

Statistical Significance Test

Why Statistical Significance Test
Suppose we have developed an EC algorithm A We want to compare with another EC algorithm B Both algorithms are stochastic How can we be sure that A is better than B? Assume we run A and B once, and get the results x and y, respectively. If x < y (minimisation), is it because A is better than B, or just because of randomness?

Why Statistical Significance Test
Treat a stochastic algorithm as a random number generator, and its output follows some distribution The random output depends on the algorithm and random seed Collect samples: run algorithms many times independently (using different random seeds) Carry out statistical significance tests based on the collected samples

Statistical Significance Test
Parametric/Non-parametric: assume/do not assume the random variables follow normal distribution Paired: Unpaired Paired Parametric T-test/z-test Paired t-test Non-parametric Wilcoxon rank sum Wilcoxon signed rank

One-sample z-test The z-test is used when 𝑛 ≥30
Test the population mean using The sample mean The sample standard deviation (σ) The number of samples z < -2 z > 2

One-sample z-test (Null) hypothesis:
Reject the hypothesis if the samples do not support it statistically (z < -2 or z > 2 under significance level of Note: the exact critical value is 1.96 at 0.05 significance level. We use 2 as a rough value.) P-value for two-tailed for lower-tailed for upper-tailed Reject the hypothesis if p-value < significance level

One-sample t-test It is used when 𝑛<30
Assume the population follows a normal distribution Almost the same as one-sample z-test The sample mean of a random variable does not follow a normal distribution, but a t-distribution depends on n (degree of freedom)

Two-sample t-test (Null) hypothesis:
Reject the hypothesis if the samples do not support it statistically Unpaired: Paired: Calculate the difference Use one-sample t-test with null hypothesis

Unpaired vs Paired y1 = x*x – x + N(0, 0.1) (1)
Step 1: generate 30 random x values for y1 from normal distribution N(0, 1) Step 2: obtain 30 y1 values using the 30 x values and Eq. (1) Step 3: generate 30 random x values for y2 from normal distribution N(0, 1) Step 4: obtain 30 y2 values using the 30 x values and Eq. (2)

Unpaired vs Paired Which one is smaller? Red or green? P-value = 0.56

Unpaired vs Paired y1 = x*x – x + N(0, 0.1) (1)
Step 1: generate 30 random x values for both y1 and y2 from normal distribution N(0, 1) Step 2: obtain 30 y1 and 30 y2 values using the 30 x values and Eqs. (1) and (2)

Unpaired vs Paired Which one is smaller? Red or green? P-value = 0.00

Unpaired vs Paired If we can eliminate the effect of all the other factors, then paired tests can give us stronger conclusions Example: for the compared algorithms, use the same random seed to generate the same initial population At least the results will not be affected by the initial population

Wilcoxon Rank Sum Test Do not require the distribution to be normal (non-parametric) or a large sample Unpaired (Null) hypothesis: , comparing two medians. U-statistic for each variable: number of wins out of all pairwise contests (count 0.5 for each tie) Check table for p-value Reject the hypothesis if p-value < significance level

Wilcoxon Signed Rank Test
Non-parametric Paired Steps: 1. Calculate the sign and absolute value of 2. Exclude the pairs with 3. Sort the pairs in the increasing order of 4. Get the rank from the sorted pairs: 5. Calculate the statistic 6. Reject the hypothesis if Check table

Wilcoxon Signed Rank Test
Fail to reject

Using Statistical Significance Tests
R t.test(y1, y2, paired=TRUE/FALSE) wilcox.test(y1, y2, paired=TRUE/FALSE) Matlab [h,p] = ttest(x,y) [p,h] = ranksum(x,y) [p,h] = signrank(x,y) Java Apache Commons Math Library

Compare p population means
(Null) hypothesis: μ 1 = μ 2 = … = μ 𝑝 Method : One-Way ANOVA (Analysis of Variance) Compare the performance of p algorithms Model assumptions : Data from the 𝑖th algorithm are assumed to come from a normal distribution Common variance Data are independent

ANOVA Test statistic: 𝐹 ~ 𝐹(𝑝−1, 𝑛−𝑝) under the null hypothesis
𝐹= 𝑀𝑆𝑅 𝑀𝑆𝐸 𝐹 ~ 𝐹(𝑝−1, 𝑛−𝑝) under the null hypothesis Demonstrate in R

R code test.data<-data.frame( Time = c(5, 4, 3, 4, 2, 3, 5, 3, 6,
4, 6, 0, 7, 7, 1, 7, 7, 0, 4, 6, 2, 6, 6, 1, 6, 4, 0 ), Method = factor(rep(c("A", "B", "C"), 9))) test.data tapply(test.data$Time, test.data$Method, mean) m1<-lm(Time ~ Method, data=test.data) summary(m1) anova(m1)

R code-continued > tapply(test.data$Time, test.data$Method, mean) A B C > anova(m1) Analysis of Variance Table Response: Time Df Sum Sq Mean Sq F value Pr(>F) Method *** Residuals Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > P-value

ANOVA Results P-value:
if p-value < α (e.g., 0.05), reject the null hypothesis Otherwise, do not reject the null hypothesis What does it mean when the null hypothesis is rejected? Not all algorithms have the same mean, that is, at least one mean is different.

Multiple Comparisons Tukey Test: Suppose there are p=3 algorithm Test
𝐻 0 : μ 1 = μ 2 versus 𝐻 1 : μ 1 ≠ μ 2 𝐻 0 : μ 1 = μ 3 versus 𝐻 1 : μ 1 ≠ μ 3 𝐻 0 : μ 2 = μ 3 versus 𝐻 1 : μ 2 ≠ μ 3

R code m1.anova<-aov(Time ~ Method, data=test.data)
Mult.test <- TukeyHSD(m1.anova, conf.level=0.95) Mult.test library(gplots) attach(test.data) jpeg("graph.jpeg") plotmeans(Time ~ Method,xlab="Method", ylab="Time", main="Mean Plot\nwith 95% CI")

R code-continued > Mult.test Tukey multiple comparisons of means
95% family-wise confidence level Fit: aov(formula = Time ~ Method, data = test.data) $Method diff lwr upr p adj B-A C-A C-B P-value to compare methods B and A P-value to compare methods C and B

R code -continued > library(gplots) > attach(test.data)
> jpeg("graph.jpeg") > plotmeans(Time ~ Method,xlab="Method", ylab="Time", main="Mean Plot\nwith 95% CI")

Statistical Significance Test

Similar presentations

Presentation on theme: "Statistical Significance Test"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Significance Test

Similar presentations

Presentation on theme: "Statistical Significance Test"— Presentation transcript:

Similar presentations

About project

Feedback