Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London

Slides:



Advertisements
Similar presentations
Quantitative Methods Topic 5 Probability Distributions
Advertisements

Z-squared: the origin and use of χ² - or - what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English.
Chapter 7 Sampling and Sampling Distributions
CHAPTER 5 REVIEW.
Tests of Significance and Measures of Association
Chapter 7 Hypothesis Testing
Inferential Statistics
CHAPTER 14: Confidence Intervals: The Basics
January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.
Chapter 16 Inferential Statistics
Chapter 8 Estimating with Confidence
Chapter 6 Sampling and Sampling Distributions
Estimation in Sampling
Chapter 8: Estimating with Confidence
Chapter 10: Estimating with Confidence
Statistics: Purpose, Approach, Method. The Basic Approach The basic principle behind the use of statistical tests of significance can be stated as: Compare.
Chapter 7 Sampling and Sampling Distributions
BHS Methods in Behavioral Sciences I
1 Hypothesis Testing In this section I want to review a few things and then introduce hypothesis testing.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #15.
BPS - 3rd Ed. Chapter 131 Confidence intervals: the basics.
Chapter 10: Estimating with Confidence
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
MA in English Linguistics Experimental design and statistics Sean Wallis Survey of English Usage University College London
Evidence Based Medicine
Chapter 11: Estimation Estimation Defined Confidence Levels
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
A Sampling Distribution
Ch 8 Estimating with Confidence. Today’s Objectives ✓ I can interpret a confidence level. ✓ I can interpret a confidence interval in context. ✓ I can.
Estimation of Statistical Parameters
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Statistics 101 Chapter 10. Section 10-1 We want to infer from the sample data some conclusion about a wider population that the sample represents. Inferential.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
10.1: Confidence Intervals – The Basics. Introduction Is caffeine dependence real? What proportion of college students engage in binge drinking? How do.
MA in English Linguistics Experimental design and statistics II Sean Wallis Survey of English Usage University College London
Capturing patterns of linguistic interaction in a parsed corpus A methodological case study Sean Wallis Survey of English Usage University College London.
Distributions of the Sample Mean
1 Chapter 6 Estimates and Sample Sizes 6-1 Estimating a Population Mean: Large Samples / σ Known 6-2 Estimating a Population Mean: Small Samples / σ Unknown.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Unit 5: Estimating with Confidence Section 10.1 Confidence Intervals: The Basics.
Section 10.1 Confidence Intervals
BPS - 3rd Ed. Chapter 131 Confidence Intervals: The Basics.
Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English.
+ “Statisticians use a confidence interval to describe the amount of uncertainty associated with a sample estimate of a population parameter.”confidence.
Sampling distributions rule of thumb…. Some important points about sample distributions… If we obtain a sample that meets the rules of thumb, then…
+ DO NOW. + Chapter 8 Estimating with Confidence 8.1Confidence Intervals: The Basics 8.2Estimating a Population Proportion 8.3Estimating a Population.
Inference: Probabilities and Distributions Feb , 2012.
Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London
Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Chapter 6 Sampling and Sampling Distributions
Comparing Two Proportions Chapter 21. In a two-sample problem, we want to compare two populations or the responses to two treatments based on two independent.
And distribution of sample means
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Comparing Two Proportions
Chapter 10: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Survey of English Usage University College London
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Some Key Ingredients for Inferential Statistics
Presentation transcript:

Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London

Outline Numbers… A simple research question –do women speak or write more than men in ICE-GB? –p = proportion = probability Another research question –what happens to speakers’ use of modal shall vs. will over time? –the idea of inferential statistics –plotting confidence intervals Concluding remarks

Numbers... We are used to concepts like these being expressed as numbers: –length (distance, height) –area –volume –temperature –wealth (income, assets)

Numbers... We are used to concepts like these being expressed as numbers: –length (distance, height) –area –volume –temperature –wealth (income, assets) We are going to discuss another concept: –probability proportion, percentage –a simple idea, at the heart of statistics

Probability Based on another, even simpler, idea: –probability p = x / n

Probability Based on another, even simpler, idea: –probability p = x / n – e.g. the probability that the speaker says will instead of shall

Probability Based on another, even simpler, idea: –probability p = x / n where –frequency x (often, f ) the number of times something actually happens the number of hits in a search – e.g. the probability that the speaker says will instead of shall

Probability Based on another, even simpler, idea: –probability p = x / n where –frequency x (often, f ) the number of times something actually happens the number of hits in a search – cases of will – e.g. the probability that the speaker says will instead of shall

Probability Based on another, even simpler, idea: –probability p = x / n where –frequency x (often, f ) the number of times something actually happens the number of hits in a search –baseline n is the number of times something could happen the number of hits –in a more general search –in several alternative patterns (‘alternate forms’) – cases of will – e.g. the probability that the speaker says will instead of shall

Probability Based on another, even simpler, idea: –probability p = x / n where –frequency x (often, f ) the number of times something actually happens the number of hits in a search –baseline n is the number of times something could happen the number of hits –in a more general search –in several alternative patterns (‘alternate forms’) – cases of will – total: will + shall – e.g. the probability that the speaker says will instead of shall

Probability Based on another, even simpler, idea: –probability p = x / n where –frequency x (often, f ) the number of times something actually happens the number of hits in a search –baseline n is the number of times something could happen the number of hits –in a more general search –in several alternative patterns (‘alternate forms’) Probability can range from 0 to 1 – e.g. the probability that the speaker says will instead of shall – cases of will – total: will + shall

What can a corpus tell us? A corpus is a source of knowledge about language: –corpus –introspection/observation/elicitation –controlled laboratory experiment –computer simulation

What can a corpus tell us? A corpus is a source of knowledge about language: –corpus –introspection/observation/elicitation –controlled laboratory experiment –computer simulation } How do these differ in what they might tell us?

What can a corpus tell us? A corpus is a source of knowledge about language: –corpus –introspection/observation/elicitation –controlled laboratory experiment –computer simulation A corpus is a sample of language } How do these differ in what they might tell us?

What can a corpus tell us? A corpus is a source of knowledge about language: –corpus –introspection/observation/elicitation –controlled laboratory experiment –computer simulation A corpus is a sample of language, varying by: –source (e.g. speech vs. writing, age...) –levels of annotation (e.g. parsing) –size (number of words) –sampling method (random sample?) } How do these differ in what they might tell us?

What can a corpus tell us? A corpus is a source of knowledge about language: –corpus –introspection/observation/elicitation –controlled laboratory experiment –computer simulation A corpus is a sample of language, varying by: –source (e.g. speech vs. writing, age...) –levels of annotation (e.g. parsing) –size (number of words) –sampling method (random sample?) } How do these differ in what they might tell us? How does this affect the types of knowledge we might obtain? }

What can a parsed corpus tell us? Three kinds of evidence may be found in a parsed corpus:

What can a parsed corpus tell us? Three kinds of evidence may be found in a parsed corpus:  Frequency evidence of a particular known rule, structure or linguistic event - How often?

What can a parsed corpus tell us? Three kinds of evidence may be found in a parsed corpus:  Frequency evidence of a particular known rule, structure or linguistic event  Factual evidence of new rules, etc. - How novel? - How often?

What can a parsed corpus tell us? Three kinds of evidence may be found in a parsed corpus:  Frequency evidence of a particular known rule, structure or linguistic event  Factual evidence of new rules, etc.  Interaction evidence of relationships between rules, structures and events - Does X affect Y? - How novel? - How often?

What can a parsed corpus tell us? Three kinds of evidence may be found in a parsed corpus:  Frequency evidence of a particular known rule, structure or linguistic event  Factual evidence of new rules, etc.  Interaction evidence of relationships between rules, structures and events Lexical searches may also be made more precise using the grammatical analysis - Does X affect Y? - How novel? - How often?

A simple research question Let us consider the following question: Do women speak or write more words than men in the ICE-GB corpus? What do you think? How might we find out?

Lets get some data Open ICE-GB with ICECUP –Text Fragment query for words: “*+ ” counts every word, excluding pauses and punctuation

Lets get some data Open ICE-GB with ICECUP –Text Fragment query for words: “*+ ” counts every word, excluding pauses and punctuation –Variable query: TEXT CATEGORY = spoken, written

Lets get some data Open ICE-GB with ICECUP –Text Fragment query for words: “*+ ” counts every word, excluding pauses and punctuation –Variable query: TEXT CATEGORY = spoken, written –Variable query: SPEAKER GENDER = f, m, combine these 3 queries }

Lets get some data Open ICE-GB with ICECUP –Text Fragment query for words: “*+ ” counts every word, excluding pauses and punctuation –Variable query: TEXT CATEGORY = spoken, written –Variable query: SPEAKER GENDER = f, m, combine these 3 queries }

ICE-GB: gender / written-spoken Proportion of words in each category spoken/written by women and men –The authors of some texts are unspecified –Some written material may be jointly authored –female/male ratio varies slightly TOTAL spoken written female male p

ICE-GB: gender / written-spoken Proportion of words in each category spoken/written by women and men –The authors of some texts are unspecified –Some written material may be jointly authored –female/male ratio varies slightly TOTAL spoken written female male p p (female) = words spoken by women / total words (excluding )

p = Probability = Proportion We asked ourselves the following question: –Do women speak or write more words than men in the ICE-GB corpus? –To answer this we looked at the proportion of words in ICE-GB that are produced by women (out of all words where the gender is known)

p = Probability = Proportion We asked ourselves the following question: –Do women speak or write more words than men in the ICE-GB corpus? –To answer this we looked at the proportion of words in ICE-GB that are produced by women (out of all words where the gender is known) The proportion of words produced by women can also be thought of as a probability: –What is the probability that, if we were to pick any random word in ICE-GB (and the gender was known) it would be uttered by a woman?

Another research question Let us consider the following question: What happens to modal shall vs. will over time in British English? –Does shall increase or decrease? What do you think? How might we find out?

Lets get some data Open DCPSE with ICECUP –FTF query for first person declarative shall : repeat for will

Lets get some data Open DCPSE with ICECUP –FTF query for first person declarative shall : repeat for will –Corpus Map: DATE Do the first set of queries and then drop into Corpus Map }

Modal shall vs. will over time Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE) shall = 100% shall = 0% p(shall | {shall, will}) (Aarts et al. 2013)

Modal shall vs. will over time Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE) p(shall | {shall, will}) (Aarts et al. 2013) shall = 100% shall = 0%

Modal shall vs. will over time Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE) p(shall | {shall, will}) Is shall going up or down? (Aarts et al. 2013) shall = 100% shall = 0%

Is shall going up or down? Whenever we look at change, we must ask ourselves two things:

Is shall going up or down? Whenever we look at change, we must ask ourselves two things:  What is the change relative to? –Is our observation higher or lower than we might expect? In this case we ask Does shall decrease relative to shall +will ?

Is shall going up or down? Whenever we look at change, we must ask ourselves two things:  What is the change relative to? –Is our observation higher or lower than we might expect? In this case we ask Does shall decrease relative to shall +will ?  How confident are we in our results? –Is the change big enough to be reproducible?

The idea of a confidence interval All observations are imprecise –Randomness is a fact of life –Our abilities are finite: to measure accurately or reliably classify into types We need to express caution in citing numbers Example (from Levin 2013): –77.27% of uses of think in 1920s data have a literal (‘cogitate’) meaning

The idea of a confidence interval All observations are imprecise –Randomness is a fact of life –Our abilities are finite: to measure accurately or reliably classify into types We need to express caution in citing numbers Example (from Levin 2013): –77.27% of uses of think in 1920s data have a literal (‘cogitate’) meaning Really? Not 77.28, or 77.26?

The idea of a confidence interval All observations are imprecise –Randomness is a fact of life –Our abilities are finite: to measure accurately or reliably classify into types We need to express caution in citing numbers Example (from Levin 2013): –77% of uses of think in 1920s data have a literal (‘cogitate’) meaning

The idea of a confidence interval All observations are imprecise –Randomness is a fact of life –Our abilities are finite: to measure accurately or reliably classify into types We need to express caution in citing numbers Example (from Levin 2013): –77% of uses of think in 1920s data have a literal (‘cogitate’) meaning Sounds defensible. But how confident can we be in this number?

The idea of a confidence interval All observations are imprecise –Randomness is a fact of life –Our abilities are finite: to measure accurately or reliably classify into types We need to express caution in citing numbers Example (from Levin 2013): –77% (66-86%*) of uses of think in 1920s data have a literal (‘cogitate’) meaning

The idea of a confidence interval All observations are imprecise –Randomness is a fact of life –Our abilities are finite: to measure accurately or reliably classify into types We need to express caution in citing numbers Example (from Levin 2013): –77% (66-86%*) of uses of think in 1920s data have a literal (‘cogitate’) meaning Finally we have a credible range of values - needs a footnote* to explain how it was calculated.

The ‘sample’ and the ‘population’ We said that the corpus was a sample

The ‘sample’ and the ‘population’ We said that the corpus was a sample Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) –We asked questions about the sample –The answers were statements of fact

The ‘sample’ and the ‘population’ We said that the corpus was a sample Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) –We asked questions about the sample –The answers were statements of fact Now we are asking about “British English” ?

The ‘sample’ and the ‘population’ We said that the corpus was a sample Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) –We asked questions about the sample –The answers were statements of fact Now we are asking about “British English” –We want to draw an inference from the sample (in this case, DCPSE) to the population (similarly-sampled BrE utterances) –This inference is a best guess –This process is called inferential statistics

Basic inferential statistics Suppose we carry out an experiment –We toss a coin 10 times and get 5 heads –How confident are we in the results? Suppose we repeat the experiment Will we get the same result again?

Basic inferential statistics Suppose we carry out an experiment –We toss a coin 10 times and get 5 heads –How confident are we in the results? Suppose we repeat the experiment Will we get the same result again? Let’s try… –You should have one coin –Toss it 10 times –Write down how many heads you get –Do you all get the same results?

The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean X F N = 1 x We toss a coin 10 times, and get 5 heads X

The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean X F N = 4 x Due to chance, some samples will have a higher or lower score X

The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean X F N = 8 x Due to chance, some samples will have a higher or lower score X

The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean X F N = 12 x Due to chance, some samples will have a higher or lower score X

The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean X F N = 16 x Due to chance, some samples will have a higher or lower score X

The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean X F N = 20 x Due to chance, some samples will have a higher or lower score X

The Binomial distribution Repeated sampling tends to form a Binomial distribution around the expected mean X F N = 26 x Due to chance, some samples will have a higher or lower score X

The Binomial distribution It is helpful to express x as the probability of choosing a head, p, with expected mean P p = x / n –n = max. number of possible heads (10) Probabilities are in the range 0 to 1 =percentages (0 to 100%) F p P

The Binomial distribution Take-home point: –A single observation, say x hits (or p as a proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’! Estimating the confidence you have in your results is essential F p P p

The Binomial distribution Take-home point: –A single observation, say x hits (or p as a proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’! Estimating the confidence you have in your results is essential –We want to make predictions about future runs of the same experiment F p P p

Binomial  Normal The Binomial (discrete) distribution is close to the Normal (continuous) distribution x F

The central limit theorem Any Normal distribution can be defined by only two variables and the Normal function z z. S F –With more data in the experiment, S will be smaller p  population mean P  standard deviation S =  P(1 – P) / n 

The central limit theorem Any Normal distribution can be defined by only two variables and the Normal function z z. S F 2.5%  population mean P –95% of the curve is within ~2 standard deviations of the expected mean  standard deviation S =  P(1 – P) / n  p % –the correct figure is ! =the critical value of z for an error level of 0.05.

The single-sample z test... Is an observation p > z standard deviations from the expected (population) mean P ? z. S F P 2.5% p observation p If yes, p is significantly different from P

...gives us a “confidence interval” P ± z. S is the confidence interval for P –We want to plot the interval about p z. S F P p %

...gives us a “confidence interval” P ± z. S is the confidence interval for P –We want to plot the interval about p w+w+ F P 2.5% p observation p w–w– 95%

...gives us a “confidence interval” The interval about p is called the Wilson score interval This interval reflects the Normal interval about P : If P is at the upper limit of p, p is at the lower limit of P (Wallis, 2013) F P 2.5% p w+w+ observation p w–w–

Modal shall vs. will over time Simple test: –Compare p for all LLC texts in DCPSE ( ) with all ICE-GB texts (early 1990s) –We get the following data –We may plot the probability of shall being selected, with Wilson intervals LLC ICE-GB p(shall | {shall, will})

Modal shall vs. will over time Simple test: –Compare p for all LLC texts in DCPSE ( ) with all ICE-GB texts (early 1990s) –We get the following data –We may plot the probability of shall being selected, with Wilson intervals LLC ICE-GB p(shall | {shall, will}) May be input in a 2 x 2 chi-square test - or you can check Wilson intervals

p(shall | {shall, will}) Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) Small amounts of data / year

Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) p(shall | {shall, will}) Small amounts of data / year Confidence intervals identify the degree of certainty in our results

Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) Small amounts of data / year Confidence intervals identify the degree of certainty in our results Highly skewed p in some cases – p = 0 or 1 (circled)

Modal shall vs. will over time Plotting modal shall/will over time (DCPSE) Small amounts of data / year Confidence intervals identify the degree of certainty in our results We can now estimate an approximate downwards curve (Aarts et al. 2013)

Recap Whenever we look at change, we must ask ourselves two things:  What is the change relative to? –Is our observation higher or lower than we might expect? In this case we ask Does shall decrease relative to shall +will ?  How confident are we in our results? –Is the change big enough to be reproducible?

Conclusions An observation is not the actual value –Repeating the experiment might get different results The basic idea of these methods is –Predict range of future results if experiment was repeated ‘Significant’ = effect > 0 (e.g. 19 times out of 20) Based on the Binomial distribution –Approximated by Normal distribution – many uses Plotting confidence intervals Use goodness of fit or single-sample z tests to compare an observation with an expected baseline Use 2  2 tests or two independent sample z tests to compare two observed samples

References Aarts, B., J. Close, G. Leech and S.A. Wallis (eds). The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP. –Aarts, B., Close, J., and Wallis, S.A Choices over time: methodological issues in investigating current change. Chapter 2. –Levin, M The progressive in modern American English. Chapter 8. Wallis, S.A Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics 20:3, Wilson, E.B Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: NOTE: Statistics papers, more explanation, spreadsheets etc. are published on corp.ling.stats blog: