Survey of English Usage University College London

Survey of English Usage University College London
Summer School in English Corpus Linguistics Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London

Outline Numbers… A simple research question Another research question
do women speak or write more than men in ICE-GB? p = proportion = probability Another research question what happens to speakers’ use of modal shall vs. will over time? the idea of inferential statistics plotting confidence intervals Concluding remarks

Numbers... We are used to concepts like these being expressed as numbers: length (distance, height) area volume temperature wealth (income, assets)

Numbers... We are used to concepts like these being expressed as numbers: length (distance, height) area volume temperature wealth (income, assets) We are going to discuss another concept: probability proportion, percentage a simple idea, at the heart of statistics

Probability Based on another, even simpler, idea:
probability p = x / n

Probability Based on another, even simpler, idea:
probability p = x / n e.g. the probability that the speaker says will instead of shall

Probability Based on another, even simpler, idea: where
probability p = x / n where frequency x (often, f ) the number of times something actually happens the number of hits in a search e.g. the probability that the speaker says will instead of shall

probability p = x / n where frequency x (often, f ) the number of times something actually happens the number of hits in a search e.g. the probability that the speaker says will instead of shall cases of will

probability p = x / n where frequency x (often, f ) the number of times something actually happens the number of hits in a search baseline n is the number of times something could happen the number of hits in a more general search in several alternative patterns (‘alternate forms’) e.g. the probability that the speaker says will instead of shall cases of will

probability p = x / n where frequency x (often, f ) the number of times something actually happens the number of hits in a search baseline n is the number of times something could happen the number of hits in a more general search in several alternative patterns (‘alternate forms’) e.g. the probability that the speaker says will instead of shall cases of will total: will + shall

probability p = x / n where frequency x (often, f ) the number of times something actually happens the number of hits in a search baseline n is the number of times something could happen the number of hits in a more general search in several alternative patterns (‘alternate forms’) Probability can range from 0 to 1 e.g. the probability that the speaker says will instead of shall cases of will total: will + shall

What can a corpus tell us?
A corpus is a source of knowledge about language: corpus introspection/observation/elicitation controlled laboratory experiment computer simulation

A corpus is a source of knowledge about language: corpus introspection/observation/elicitation controlled laboratory experiment computer simulation } How do these differ in what they might tell us?

A corpus is a source of knowledge about language: corpus introspection/observation/elicitation controlled laboratory experiment computer simulation A corpus is a sample of language } How do these differ in what they might tell us?

A corpus is a source of knowledge about language: corpus introspection/observation/elicitation controlled laboratory experiment computer simulation A corpus is a sample of language, varying by: source (e.g. speech vs. writing, age...) levels of annotation (e.g. parsing) size (number of words) sampling method (random sample?) } How do these differ in what they might tell us?

A corpus is a source of knowledge about language: corpus introspection/observation/elicitation controlled laboratory experiment computer simulation A corpus is a sample of language, varying by: source (e.g. speech vs. writing, age...) levels of annotation (e.g. parsing) size (number of words) sampling method (random sample?) } How do these differ in what they might tell us? } How does this affect the types of knowledge we might obtain?

What can a parsed corpus tell us?
Three kinds of evidence may be found in a parsed corpus:

Three kinds of evidence may be found in a parsed corpus: Frequency evidence of a particular known rule, structure or linguistic event - How often?

Three kinds of evidence may be found in a parsed corpus: Frequency evidence of a particular known rule, structure or linguistic event Coverage evidence of new rules, etc. - How often? - How novel?

Three kinds of evidence may be found in a parsed corpus: Frequency evidence of a particular known rule, structure or linguistic event Coverage evidence of new rules, etc. Interaction evidence of relationships between rules, structures and events - How often? - How novel? - Does X affect Y?

Three kinds of evidence may be found in a parsed corpus: Frequency evidence of a particular known rule, structure or linguistic event Coverage evidence of new rules, etc. Interaction evidence of relationships between rules, structures and events Lexical searches may also be made more precise using the grammatical analysis - How often? - How novel? - Does X affect Y?

A simple research question
Let us consider the following question: Do women speak or write more words than men in the ICE-GB corpus? What do you think? How might we find out?

Lets get some data Open ICE-GB with ICECUP
Text Fragment query for words: “*+<{~PUNC,~PAUSE}>” counts every word, excluding pauses and punctuation

Lets get some data Open ICE-GB with ICECUP
Text Fragment query for words: “*+<{~PUNC,~PAUSE}>” counts every word, excluding pauses and punctuation Variable query: TEXT CATEGORY = spoken, written

} Lets get some data Open ICE-GB with ICECUP
Text Fragment query for words: “*+<{~PUNC,~PAUSE}>” counts every word, excluding pauses and punctuation Variable query: TEXT CATEGORY = spoken, written SPEAKER GENDER = f, m, <unknown> } combine these 3 queries

ICE-GB: gender / written-spoken
Proportion of words in each category spoken/written by women and men The authors of some texts are unspecified Some written material may be jointly authored female/male ratio varies slightly female written male spoken TOTAL p 0.2 0.4 0.6 0.8 1

ICE-GB: gender / written-spoken
Proportion of words in each category spoken/written by women and men The authors of some texts are unspecified Some written material may be jointly authored female/male ratio varies slightly female written p (female) = words spoken by women / total words (excluding <unknown>) male spoken TOTAL p 0.2 0.4 0.6 0.8 1

p = Probability = Proportion
We asked ourselves the following question: Do women speak or write more words than men in the ICE-GB corpus? To answer this we looked at the proportion of words in ICE-GB that are produced by women (out of all words where the gender is known)

p = Probability = Proportion
We asked ourselves the following question: Do women speak or write more words than men in the ICE-GB corpus? To answer this we looked at the proportion of words in ICE-GB that are produced by women (out of all words where the gender is known) The proportion of words produced by women can also be thought of as a probability: What is the probability that, if we were to pick any random word in ICE-GB (and the gender was known) it would be uttered by a woman?

Another research question
Let us consider the following question: What happens to modal shall vs. will over time in British English? Does shall increase or decrease? What do you think? How might we find out?

Lets get some data Open DCPSE with ICECUP
FTF query for first person declarative shall: repeat for will

Do the first set of queries and then drop into Corpus Map
Lets get some data Open DCPSE with ICECUP FTF query for first person declarative shall: repeat for will Corpus Map: DATE } Do the first set of queries and then drop into Corpus Map

Modal shall vs. will over time
Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE) 1.0 shall = 100% p(shall | {shall, will}) 0.8 0.6 0.4 0.2 shall = 0% 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 (Aarts et al., 2013)

Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE) 1.0 shall = 100% p(shall | {shall, will}) 0.8 0.6 0.4 Is shall going up or down? 0.2 shall = 0% 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 (Aarts et al., 2013)

Is shall going up or down?
Whenever we look at change, we must ask ourselves two things:

Whenever we look at change, we must ask ourselves two things: What is the change relative to? Is our observation higher or lower than we might expect? In this case we ask Does shall decrease relative to shall +will ?

Whenever we look at change, we must ask ourselves two things: What is the change relative to? Is our observation higher or lower than we might expect? In this case we ask Does shall decrease relative to shall +will ? How confident are we in our results? Is the change big enough to be reproducible?

The ‘sample’ and the ‘population’
We said that the corpus was a sample

We said that the corpus was a sample Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) We asked questions about the sample The answers were statements of fact

We said that the corpus was a sample Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) We asked questions about the sample The answers were statements of fact Now we are asking about “British English” ?

We said that the corpus was a sample Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) We asked questions about the sample The answers were statements of fact Now we are asking about “British English” We want to draw an inference from the sample (in this case, DCPSE) to the population (similarly-sampled BrE utterances) This inference is a best guess This process is called inferential statistics

Basic inferential statistics
Suppose we carry out an experiment We toss a coin 10 times and get 5 heads How confident are we in the results? Suppose we repeat the experiment Will we get the same result again?

Basic inferential statistics
Suppose we carry out an experiment We toss a coin 10 times and get 5 heads How confident are we in the results? Suppose we repeat the experiment Will we get the same result again? Let’s try… You should have one coin Toss it 10 times Write down how many heads you get Do you all get the same results?

The Binomial distribution
Repeated sampling tends to form a Binomial distribution around the expected mean X F We toss a coin 10 times, and get 5 heads N = 1 X x 5 3 1 7 9

Repeated sampling tends to form a Binomial distribution around the expected mean X F Due to chance, some samples will have a higher or lower score N = 4 X x 5 3 1 7 9

It is helpful to express x as the probability of choosing a head, p, with expected mean P p = x / n n = max. number of possible heads (10) Probabilities are in the range 0 to 1 percentages (0 to 100%) F P p 0.5 0.3 0.1 0.7 0.9

Take-home point: A single observation, say x hits (or p as a proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’! Estimating the confidence you have in your results is essential F p P p 0.5 0.3 0.1 0.7 0.9

Take-home point: A single observation, say x hits (or p as a proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’! Estimating the confidence you have in your results is essential We want to make predictions about future runs of the same experiment F p P p 0.5 0.3 0.1 0.7 0.9

Binomial  Normal The Binomial (discrete) distribution is close to the Normal (continuous) distribution F x 0.5 0.3 0.1 0.7 0.9

The central limit theorem
Any Normal distribution can be defined by only two variables and the Normal function z  population mean P  standard deviation S =  P(1 – P) / n F  With more data in the experiment, S will be smaller z . S z . S 0.1 0.3 0.5 0.7 p

The central limit theorem
Any Normal distribution can be defined by only two variables and the Normal function z  population mean P  standard deviation S =  P(1 – P) / n F  z . S z . S 95% of the curve is within ~2 standard deviations of the expected mean the correct figure is ! the critical value of z for an error level of 0.05. 2.5% 2.5% 95% 0.1 0.3 0.5 0.7 p

The single-sample z test...
Is an observation p > z standard deviations from the expected (population) mean P? If yes, p is significantly different from P F observation p z . S z . S 2.5% 2.5% P 0.1 0.3 0.5 0.7 p

...gives us a “confidence interval”
P ± z . S is the confidence interval for P We want to plot the interval about p F z . S z . S 2.5% 2.5% P 0.1 0.3 0.5 0.7 p

P ± z . S is the confidence interval for P We want to plot the interval about p w+ F P 2.5% p 0.5 0.3 0.1 0.7 observation p w–

The interval about p is called the Wilson score interval observation p This interval reflects the Normal interval about P: If P is at the upper limit of p, p is at the lower limit of P F w– w+ P (Wallis, 2013) 2.5% 2.5% 0.1 0.3 0.5 0.7 p

Simple test: Compare p for all LLC texts in DCPSE ( ) with all ICE-GB texts (early 1990s) We get the following data We may plot the probability of shall being selected, with Wilson intervals 0.0 0.2 0.4 0.6 0.8 1.0 LLC ICE-GB p(shall | {shall, will})

Simple test: Compare p for all LLC texts in DCPSE ( ) with all ICE-GB texts (early 1990s) We get the following data We may plot the probability of shall being selected, with Wilson intervals May be input in a 2 x 2 chi-square test 0.0 0.2 0.4 0.6 0.8 1.0 LLC ICE-GB p(shall | {shall, will}) - or you can check Wilson intervals

Plotting modal shall/will over time (DCPSE) Small amounts of data / year 1.0 p(shall | {shall, will}) 0.8 0.6 0.4 0.2 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995

Plotting modal shall/will over time (DCPSE) Small amounts of data / year Confidence intervals identify the degree of certainty in our results 1.0 p(shall | {shall, will}) 0.8 0.6 0.4 0.2 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995

Plotting modal shall/will over time (DCPSE) Small amounts of data / year Confidence intervals identify the degree of certainty in our results Highly skewed p in some cases p = 0 or 1 (circled) 0.0 0.2 0.4 0.6 0.8 1.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 p(shall | {shall, will})

Plotting modal shall/will over time (DCPSE) Small amounts of data / year Confidence intervals identify the degree of certainty in our results We can now estimate an approximate downwards curve 0.0 0.2 0.4 0.6 0.8 1.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 p(shall | {shall, will}) (Aarts et al., 2013)

Recap Whenever we look at change, we must ask ourselves two things:
What is the change relative to? Is our observation higher or lower than we might expect? In this case we ask Does shall decrease relative to shall +will ? How confident are we in our results? Is the change big enough to be reproducible?

Conclusions An observation is not the actual value
Repeating the experiment might get different results The basic idea of these methods is Predict range of future results if experiment was repeated ‘Significant’ = effect > 0 (e.g. 19 times out of 20) Based on the Binomial distribution Approximated by Normal distribution – many uses Plotting confidence intervals Use goodness of fit or single-sample z tests to compare an observation with an expected baseline Use 22 tests or two independent sample z tests to compare two observed samples

Research questions and baselines
Suppose you are told that cycling is getting safer

Suppose you are told that cycling is getting safer Do you believe them? would you start cycling?

Suppose you are told that cycling is getting safer Do you believe them? would you start cycling? Facts fatalities have increased

Suppose you are told that cycling is getting safer Do you believe them? would you start cycling? Facts fatalities have increased What is the most meaningful statistic? p (accident | population) – p (accident | cyclist) p (accident | journey) – p (accident | km) See e.g. how-dangerous-is-cycling

Suppose you are told that cycling is getting safer Do you believe them? would you start cycling? Facts fatalities have increased are there more cyclists now? car passengers/drivers pedestrians motorcyclists cyclists other vehicles

Suppose you are told that cycling is getting safer Do you believe them? would you start cycling? Facts fatalities have increased yes, there are more cyclists now likelihood of death per km traveled has fallen motorcyclists pedestrians cyclists car passengers/drivers

Thinking about confidence intervals
In 2006 the UK medical journal, The Lancet published a study of ‘excess deaths’ as a result of the US/UK invasion of Iraq in 2003 654,965 (392,979, 942,636) at a 95% confidence level Previous estimates were much lower below 100,000 The study was criticised for its wide confidence interval is this fair?

Thinking about confidence intervals
In 2006 the UK medical journal, The Lancet published a study of ‘excess deaths’ as a result of the US/UK invasion of Iraq in 2003 654,965 (392,979, 942,636) at a 95% confidence level Previous estimates were much lower below 100,000 The study was criticised for its wide confidence interval is this fair? (Burnam et al. 2006) p 100 300 500 700 p

References Aarts, B., Close, J., and Wallis, S.A Choices over time: methodological issues in investigating current change. Chapter 2 in Aarts, B. Close, J., Leech G., and Wallis, S.A. (eds.) The Verb Phrase in English. Cambridge University Press. Burnham, G, Lafta, R, Doocy, S, and Roberts, L Mortality after the 2003 invasion of Iraq: a cross-sectional cluster sample survey. Lancet. 368: 1421–1428. Wallis, S.A Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics 20:3, Wilson, E.B Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: NOTE: Statistics papers, more explanation, spreadsheets etc. are published on corp.ling.stats blog:

Survey of English Usage University College London

Similar presentations

Presentation on theme: "Survey of English Usage University College London"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Survey of English Usage University College London

Similar presentations

Presentation on theme: "Survey of English Usage University College London"— Presentation transcript:

Similar presentations

About project

Feedback