Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Similar presentations


Presentation on theme: "Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London"— Presentation transcript:

1 Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

2 Outline What is the point of statistics? –Variationist corpus linguistics –How inferential statistics works Introducing z tests –Two types (single-sample and two-sample) –How these tests are related to χ² ‘Effect size’ and comparing results of experiments Methodological implications for corpus linguistics

3 What is the point of statistics? Analyse data you already have –corpus linguistics Design new experiments –collect new data, add annotation –experimental linguistics ‘in the lab’ Try new methods –pose the right question We are going to focus on z and χ² tests

4 What is the point of statistics? Analyse data you already have –corpus linguistics Design new experiments –collect new data, add annotation –experimental linguistics ‘in the lab’ Try new methods –pose the right question We are going to focus on z and χ² tests experimental science } observational science } philosophy of science } a little maths }

5 What is ‘inferential statistics’? Suppose we carry out an experiment –We toss a coin 10 times and get 5 heads –How confident are we in the results? Suppose we repeat the experiment Will we get the same result again? Inferential statistics is a method of inferring the behaviour of future ‘ghost’ experiments from one experiment –We infer from the sample to the population Let us consider one type of experiment –Linguistic alternation experiments

6 Alternation experiments A variationist corpus paradigm Imagine a speaker forming a sentence as a series of decisions/choices. They can –add: choose to extend a phrase or clause, or stop –select: choose between constructions Choices will be constrained –grammatically –semantically

7 Alternation experiments A variationist corpus paradigm Imagine a speaker forming a sentence as a series of decisions/choices. They can –add: choose to extend a phrase or clause, or stop –select: choose between constructions Choices will be constrained –grammatically –semantically Research question: –within these constraints, what factors influence the particular choice?

8 Alternation experiments Laboratory experiment (cued) –pose the choice to subjects –observe the one they make –manipulate different potential influences Observational experiment (uncued) –observe the choices speakers make when they make them (e.g. in a corpus) –extract data for different potential influences sociolinguistic: subdivide data by genre, etc lexical/grammatical: subdivide data by elements in surrounding context –BUT the alternate choice is counterfactual

9 Statistical assumptions  A random sample taken from the population –Not always easy to achieve multiple cases from the same text and speakers, etc may be limited historical data available –Be careful with data concentrated in a few texts  The sample is tiny compared to the population –This is easy to satisfy in linguistics!  Observations are free to vary (alternate)  Repeated sampling tends to form a Binomial distribution around the expected mean –This requires slightly more explanation...

10 The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean P F N = 1 x 53179 We toss a coin 10 times, and get 5 heads P

11 The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean P F N = 4 x 53179 Due to chance, some samples will have a higher or lower score P

12 The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean P F N = 8 x 53179 Due to chance, some samples will have a higher or lower score P

13 The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean P F N = 12 x 53179 Due to chance, some samples will have a higher or lower score P

14 The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean P F N = 16 x 53179 Due to chance, some samples will have a higher or lower score P

15 The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean P F N = 20 x 53179 Due to chance, some samples will have a higher or lower score P

16 The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean P F N = 24 x 53179 Due to chance, some samples will have a higher or lower score P

17 Binomial  Normal The Binomial (discrete) distribution is close to the Normal (continuous) distribution x F 53179

18 The central limit theorem Any Normal distribution can be defined by only two variables and the Normal function z z. s F –With more data in the experiment, s will be smaller p 0.50.30.10.7 –Divide x by 10 for probability scale  population mean P  standard deviation s =  P(1 – P) / n 

19 The central limit theorem Any Normal distribution can be defined by only two variables and the Normal function z z. s F 2.5%  population mean P –95% of the curve is within ~2 standard deviations of the expected mean  standard deviation s =  P(1 – P) / n  p 0.50.30.10.7 95% –the correct figure is 1.95996! =the critical value of z for an error level of 0.05.

20 The central limit theorem Any Normal distribution can be defined by only two variables and the Normal function z z. s F 2.5%  population mean P  standard deviation s =  P(1 – P) / n  p 0.50.30.10.7 95% =the critical value of z for an error level  of 0.05. z  /2

21 The single-sample z test... Is an observation p > z standard deviations from the expected (population) mean P ? z. s F P 0.25% p 0.50.30.10.7 observation p If yes, p is significantly different from P

22 ...gives us a “confidence interval” P ± z. s is the confidence interval for P –We want to plot the interval about p z. s F P 0.25% p 0.50.30.10.7

23 ...gives us a “confidence interval” P ± z. s is the confidence interval for P –We want to plot the interval about p w+w+ F P 0.25% p 0.50.30.10.7 observation p w–w–

24 ...gives us a “confidence interval” The interval about p is called the Wilson score interval This interval is asymmetric It reflects the Normal interval about P : If P is at the upper limit of p, p is at the lower limit of P (Wallis, to appear, a) w+w+ F P 0.25% p 0.50.30.10.7 observation p w–w–

25 ...gives us a “confidence interval” The interval about p is called the Wilson score interval To calculate w – and w + we use this formula: (Wilson, 1927) w+w+ F P 0.25% p 0.50.30.10.7 observation p w–w–

26 Plotting confidence intervals Plotting modal shall/will over time (DCPSE) Small amounts of data / year Highly skewed p in some cases – p = 0 or 1 (circled) Confidence intervals identify the degree of certainty in our results (Wallis, to appear, a)

27 Plotting confidence intervals Probability of adding successive attributive adjective phrases (AJPs) to a NP in ICE-GB –x = number of AJPs NPs get longer  adding AJPs is more difficult The first two falls are significant, the last is not 0.00 0.05 0.10 0.15 0.20 0.25 01234 p x

28 2 x 1 goodness of fit χ² test Same as single-sample z test for P ( z² = χ² ) –Does the value of a affect p(b) ? z. s F P = p(b) p p(b | a) p(b)p(b) IV: A = {a, ¬a} DV: B = {b, ¬b}

29 2 x 1 goodness of fit χ² test Same as single-sample z test for P ( z² = χ² ) Or Wilson test for p (by inversion) F P = p(b) p p(b | a) w+w+ w–w– p(b)p(b) IV: A = {a, ¬a} DV: B = {b, ¬b}

30 The single-sample z test Compares an observation with a given value –Compare p(b | a) with p(b) –A “goodness of fit” test –Identical to a standard 2  1 χ² test Note that p(b) is given –All of the variation is assumed to be in the estimate of p(b | a) –Could also compare p(b | ¬a) with p(b) p(b | a) p(b)p(b) p(b | ¬a)

31 z test for 2 independent proportions Method: combine observed values –take the difference (subtract) |p 1 – p 2 | –calculate an ‘averaged’ confidence interval p p 2 = p(b | ¬a) O1O1 O2O2 F p 1 = p(b | a) (Wallis, to appear, b)

32 z test for 2 independent proportions New confidence interval D = |O 1 – O 2 | –standard deviation s' =  p(1 – p) (1/n 1 +1/n 2 ) – p = p(b) –compare z.s' with x = |p 1 – p 2 |  p p D x  ^ ^ ^ z.s'z.s' mean x = 0 0 (Wallis, to appear, b)

33 z test for 2 independent proportions Identical to a standard 2  2 χ² test –So you can use the usual method!

34 z test for 2 independent proportions Identical to a standard 2  2 χ² test –So you can use the usual method! BUT: 2  1 and 2  2 tests have different purposes –2  1 goodness of fit compares single value a with superset A assumes only a varies –2  2 test compares two values a, ¬a within a set A both values may vary A a g.o.f.  2 2  2  2 ¬a IV: A = {a, ¬a}

35 z test for 2 independent proportions Identical to a standard 2  2 χ² test –So you can use the usual method! BUT: 2  1 and 2  2 tests have different purposes –2  1 goodness of fit compares single value a with superset A assumes only a varies –2  2 test compares two values a, ¬a within a set A both values may vary Q: Do we need χ²? A a g.o.f.  2 2  2  2 ¬a IV: A = {a, ¬a}

36 Larger χ² tests χ² is popular because it can be applied to contingency tables with many values r  1 goodness of fit χ² tests (r  2) r  c χ² tests for homogeneity (r,c  2) z tests have 1 degree of freedom strength: significance is due to only one source strength: easy to plot values and confidence intervals weakness: multiple values may be unavoidable With larger χ² tests, evaluate and simplify: Examine χ² contributions for each row or column Focus on alternation - try to test for a speaker choice

37 How big is the effect? These tests do not measure the strength of the interaction between two variables –They test whether the strength of an interaction is greater than would be expected by chance With lots of data, a tiny change would be significant Don’t use χ², p or z values to compare two different experiments –A result significant at p<0.01 is not ‘better’ than one significant at p<0.05 There are a number of ways of measuring ‘association strength’ or ‘effect size’

38 How big is the effect? Percentage swing –swing d = p(a | ¬b) – p(a | b) –% swing d % = d/p(a | b) –frequently used (“X increased by 50%”) may have confidence intervals on change can be misleading (“+50%” then “-50%” is not zero) –one change, not sequence –over one value, not multiple values

39 How big is the effect? Percentage swing –swing d = p(a | ¬b) – p(a | b) –% swing d % = d/p(a | b) –frequently used (“X increased by 50%”) may have confidence intervals on change can be misleading (“+50%” then “-50%” is not zero) –one change, not sequence –over one value, not multiple values Cramér’s φ –  = χ²/N (2  2) N = grand total –  c = χ²/(k – 1)N (r  c ) k = min(r, c) measures degree of association of one variable with another (across all values)  

40 Comparing experimental results Suppose we have two similar experiments –How do we test if one result is significantly stronger than another?

41 Comparing experimental results Suppose we have two similar experiments –How do we test if one result is significantly stronger than another? Test swings –z test for two samples from different populations –Use s' = s 1 2 + s 2 2 –Test |d 1 (a) – d 2 (a)| > z.s' -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 d1(a)d1(a)d2(a)d2(a)  (Wallis 2011)

42 Comparing experimental results Suppose we have two similar experiments –How do we test if one result is significantly stronger than another? Test swings –z test for two samples from different populations –Use s' = s 1 2 + s 2 2 –Test |d 1 (a) – d 2 (a)| > z.s' Same method can be used to compare other z or χ² tests -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 d1(a)d1(a)d2(a)d2(a)  (Wallis 2011)

43 Modern improvements on z and χ² ‘Continuity correction’ for small n –Yates’ χ 2 test –errs on side of caution –can also be applied to Wilson interval Newcombe (1998) improves on 2  2 χ² test –combines two Wilson score intervals –performs better than χ² and log-likelihood (etc.) for low-frequency events or small samples However, for corpus linguists, there remains one outstanding problem...

44 Experimental design Each observation should be free to vary –i.e. p can be any value from 0 to 1 p(b | words) p(b | VPs) p(b | tensed VPs) b1b1 b2b2

45 Experimental design Each observation should be free to vary –i.e. p can be any value from 0 to 1 However many people use these methods incorrectly –e.g. citation ‘per million words’ what does this actually mean? p(b | words) p(b | VPs) p(b | tensed VPs) b1b1 b2b2

46 Experimental design Each observation should be free to vary –i.e. p can be any value from 0 to 1 However many people use these methods incorrectly –e.g. citation ‘per million words’ what does this actually mean? Baseline should be choice –Experimentalists can design choice into experiment –Corpus linguists have to infer when speakers had opportunity to choose, counterfactually p(b | words) p(b | VPs) p(b | tensed VPs) b1b1 b2b2

47 A methodological progression Aim: –investigate change when speakers have a choice Four levels of experimental refinement: pmw words 

48 A methodological progression Aim: –investigate change when speakers have a choice Four levels of experimental refinement: pmw select a plausible baseline wordstensed VPs 

49 A methodological progression Aim: –investigate change when speakers have a choice Four levels of experimental refinement: pmw select a plausible baseline grammatically restrict data or enumerate cases wordstensed VPs {will, shall} 

50 A methodological progression Aim: –investigate change when speakers have a choice Four levels of experimental refinement: pmw select a plausible baseline grammatically restrict data or enumerate cases check each case individually for plausibility of alternation wordstensed VPs {will, shall}  Ye shall be saved

51 Conclusions The basic idea of these methods is –Predict future results if experiment were repeated ‘Significant’ = effect > 0 (e.g. 19 times out of 20) Based on the Binomial distribution –Approximated by Normal distribution – many uses Plotting confidence intervals Use goodness of fit or single-sample z tests to compare an observation with an expected baseline Use 2  2 tests or two independent sample z tests to compare two observed samples When using larger r  c tests, simplify as far as possible to identify the source of variation! Take care with small samples / low frequencies –Use Wilson and Newcombe’s methods instead!

52 Conclusions Two methods for measuring the ‘size’ of an experimental effect –absolute or percentage swing –Cramér’s φ You can compare two experiments These methods all presume that –observed p is free to vary (speaker is free to choose) If this is not the case then –statistical model is undermined confidence intervals are too conservative –but multiple changes are combined into one e.g. VPs increase while modals decrease so significant change may not mean what you think!

53 References Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890 Wallis, S.A. 2011. Comparing χ² tests for separability. London: Survey of English Usage, UCL Wallis, S.A. to appear, a. Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics Wallis, S.A. to appear, b. z-squared: The origin and use of χ². Journal of Quantitative Linguistics Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212 NOTE: My statistics papers, more explanation, spreadsheets etc. are published on corp.ling.stats blog: http://corplingstats.wordpress.com


Download ppt "Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London"

Similar presentations


Ads by Google