Download presentation
Presentation is loading. Please wait.
Published byGregory Perkins Modified over 9 years ago
1
Approximate Randomization tests February 5 th, 2013
2
Classic t-test
3
Why ar testing? Classic tests often assume a given distribution (student t, normal, …) of the variable This is ≈ok for recall, but not for precision or F- score Possible hypotheses to test with non- parametric tests is limited
4
Illustration 30,000 runs, 1000 instances, 500 of class A True positives (TP): 400 (stdev:80) False positives (FP): 60 (stdev: 15) Assumption: true and false positives for class A are normally distributed. This is already an approximation since TP and FP are restricted by 0 and the number of instances.
5
Definitions Recall = truly predicted A / A in reference = truly predicted A / C te If A is normal, recall is normal. Precision = truly predicted A / A in system A in system is a non-linear combination of TP and FP. Precision is not normal. F-score: non-linear combination of recall and precision Not normal.
11
Approximate randomization test No assumption on distribution Can handle complicated statistics Only assumption: independence between shuffled elements References: – Computer Intensive Methods for Testing Hypotheses, Noreen, 1989. – More accurate tests for the statistical significance of results differences, Yeh, 2000.
12
Basic idea Exact randomization test Glass 1Glass 2Glass 3Glass 4 ContentsPolishPremiumRussianBudget ExpertPolishPremiumBudgetRussian
13
Exact probability H0: expert is independent of contents P(ncorrect ≥ 2) = 7/24 = 0.29 Thus, do not reject H0 because the probability is larger than alpha=0.05.
14
Approximate probability The number of permutations is n! => quick increase of number of permutations If too much permutations to compute: approximation: P = (nge + 1) / (NS + 1) – nge : number of times pseudostatistic ≥ actual statistic – NS: number of shuffles – +1: correction for validity
15
DIFFERENT SETUPS
16
Translation to instances Each glass is an instance Contents and expert are two labeling systems Contents has an accuracy of 100%, expert has an accuracy of 50% Statistic is precision, f-score, recall, … instead of accuracy
17
Stratified shuffling For labeled instances, it makes no sense to shuffle the class label of one instance to another Only shuffle labels per instance
18
MBT Assumpton of independence between instances Shuffle per sentence rather than per token System 1System 2 ThisDTNNS isVBZVB niceJJRB...
19
Term extraction Shuffling extracted terms between output of two term extraction systems ReferenceSystem 1System 2 happy sad good livelyhappy angry
20
Script http://www.clips.ua.ac.be/~vincent/software.html#art http://www.clips.ua.ac.be/scripts/art Options: – Exact and approximate randomization tests – Instance based, also for MBT – Term extraction based – Stratified Shuffling – Two sided / one-sided (check code!)
21
Remarks on usage It makes no sense to shuffle if exact randomization can be computed The value of p depends on NS. The larger NS, the lower p can be Validity check – Sign-test – Re-test: to alleviate bad randomization
22
Sign test Can be compared with P for accuracy H0: correctness is independent of system i.e. P(groen) = 0.5 Binomial test System 1System 2
23
Interpretation (1) ReferenceSystem 1System 2 AAB BAB CAB How much do these two systems differ based on precision for the A label? -Maximally -Intermediate -Minimally
24
Interpretation (2) LabelsPrecision A ABCSystem 1System 2Δ AB 1/30 BAAB 01 AB BA1/20 BA AB01/2-1/2 BAABBA1/20 ABBA 101 01/3-1/3 ABBAAB1/20
25
Conclusion Approximate randomization testing can be used for many applications. The basic idea is that the actual difference between two systems is (im)probable to occur when all possible permutions of the outputs are evaluated. Difference can be computed in many ways as long as the shuffled elements are independent.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.