English Corpus Linguistics Introducing the Diachronic Corpus of Present-Day Spoken English (DCPSE) Sean Wallis UCL.

Slides:



Advertisements
Similar presentations
WHEN DO WE USUALLY USE AUXILIARY VERBS
Advertisements

Z-squared: the origin and use of χ² - or - what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English.
Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London
ENGLISH B HIGHER LEVEL The Mackay School – May 2014 Examinations.
Diachronic study and language change Corpus Linguistics Richard Xiao
Modality Lecture 10. Language is not merely used for conveying factual information A speaker may wish to indicate a degree of certainty to try to influence.
Syntax Lecture 10: Auxiliaries. Types of auxiliary verb Modal auxiliaries belong to the category of inflection – They are in complementary distribution.
MSc Epidemiology Exams what, why, when, how. Paper 1 Covers extended epidemiology, STEPH and clinical trials Purpose of today’s talk: –Explain format.
Verbs Longman Student Grammar of Spoken and Written English Biber; Conrad; Leech (2009, p ) Verbs provide the focal point of the clause. The main.
Capturing linguistic interaction in a grammar A method for empirically evaluating the grammar of a parsed corpus Sean Wallis Survey of English Usage University.
Word Order Choices Chapter 12
Main Verbs and Auxiliaries Made by: Koletta Kisbalázs
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 19 Confidence Intervals for Proportions.
Confidence Intervals for Proportions
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
The Subjunctive in Spoken British English ICAME, Lancaster, 28 th May Jo Close & Bas Aarts, UCL
Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically.
Statistics for Linguistics Students Michaelmas 2004 Week 3 Bettina Braun
Inferential Statistics
Corpus Linguistics Case study 2 Grammatical studies based on morphemes or words. G Kennedy (1998) An introduction to corpus linguistics, London: Longman,
Test Taking Tips How to help yourself with multiple choice and short answer questions for reading selections A. Caldwell.
Chapter 2 Words and word classes.
The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.
WEST-E Practice Sample Questions and Answers. The WEST-E and Syntax You should know the following: –Recognize similarities and differences between the.
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
Fundamentals of Hypothesis Testing: One-Sample Tests
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 1: a) Describe the shape, center, and spread of the sampling distribution of. Because n 1 p 1 = 100(0.7) = 70, n 1 (1 − p 1 ) = 100(0.3) = 30,
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 15 Inference for Counts:
MA in English Linguistics Experimental design and statistics Sean Wallis Survey of English Usage University College London
Revising the comprehension paper Aim To know what you need to do in each section of Paper 2.
Dr. Monira Al-Mohizea MORPHOLOGY & SYNTAX WEEK 11.
Recent change in spoken English: the perfect construction Jill Bowie Survey of English Usage, UCL 27 October 2010
AP Statistics Section 11.1 A Basics of Significance Tests
Standard Error and Confidence Intervals Martin Bland Professor of Health Statistics University of York
From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.
Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster May 2009 Sean Wallis Survey of English Usage University College London.
MA in English Linguistics Experimental design and statistics II Sean Wallis Survey of English Usage University College London
IE241: Introduction to Hypothesis Testing. We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
Capturing patterns of linguistic interaction in a parsed corpus A methodological case study Sean Wallis Survey of English Usage University College London.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English.
CHAPTER 9 Testing a Claim
Question paper 1997.
+ DO NOW. + Chapter 8 Estimating with Confidence 8.1Confidence Intervals: The Basics 8.2Estimating a Population Proportion 8.3Estimating a Population.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
Copyright © 2009 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.1 Categorical Response: Comparing Two Proportions.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Verb phrases Main reference: Randolph Quirk and Sidney Greenbaum, A University Grammar of English, Longman: London, (3.23 – 3.55)
Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London
AP Statistics Chapter 11 Notes. Significance Test & Hypothesis Significance test: a formal procedure for comparing observed data with a hypothesis whose.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Unit 5: Hypothesis Testing.
Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.
Differences between EAP and EGP Features of EAP. Categories for the main distinguishing features of Academic English Complexity Formality Precision Objectivity.
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Inflection. Inflection refers to word formation that does not change category and does not create new lexemes, but rather changes the form of lexemes.
10.1 Estimating with Confidence Chapter 10 Introduction to Inference.
System and the axis of Choice  Systems are list of choices which are available in the grammar of a language.  It could be a list of things b/w which.
Usage-Based Phonology Anna Nordenskjöld Bergman. Usage-Based Phonology overall approach What is the overall approach taken by this theory? summarize How.
Futurity Lecture 12.
Collecting Written Data
Unit 5 – Chapters 10 and 12 What happens if we don’t know the values of population parameters like and ? Can we estimate their values somehow?
Unit 5: Hypothesis Testing
Grammar Workshop Thursday 9th June.
Significance Tests: The Basics
Lexico-grammar: From simple counts to complex models
Survey of English Usage University College London
Presentation transcript:

English Corpus Linguistics Introducing the Diachronic Corpus of Present-Day Spoken English (DCPSE) Sean Wallis UCL

Barber (1964): changes in English grammar a.A tendency to regularize irregular morphology (e.g. dreamt- dreamed); b. A revival of the “mandative” subjunctive, probably inspired by formal US usage (we demand that she take part in the meeting); c. Elimination of shall as a future marker in the first person; d.Development of new, auxiliary-like uses of certain lexical verbs (e.g. get, want – cf., e.g., The way you look, you wanna / want to see a doctor soon); e.Extension of the progressive to new constructions, e.g. modal, present perfect and past perfect passive progressive (the road would not be being built/ has not been being built/ had not been being built before the general elections); f.Increase in the number and types of multi-word verbs (phrasal verbs, have/take/give a ride, etc.); g.Placement of frequency adverbs before auxiliary verbs (even if no emphasis is intended – I never have said so); h. Do-support for have (have you any money? and no, I haven’t any money - do you have/ have you got any money? and no, I don’t have any money/ haven’t got any money)…

The Diachronic Corpus of Present-day Spoken English (DCPSE) –Orthographically transcribed spoken BrE –Fully parsed every ‘sentence’ has a tree diagram searchable with ICECUP and FTFs –400,000+ words each from London-Lund Corpus (aka The ‘Survey Corpus’) ICE-GB –Balanced by text category –Not evenly distributed by year LLC: samples from ICE-GB:

Tree diagrams A tree diagram for the sentence We’re getting there.

Barber on shall and will [T]he distinctions formerly made between shall and will are being lost, and will is coming increasingly to be used instead of shall. One reason for this is that in speech we very often say neither [will] nor [shall], but just [’ll]: I’ll see you to-morrow, we’ll meet you at the station, John’ll get it for you. We cannot use this weak form in all positions (not at the end of a phrase, for example), but we use it very often; and, whatever its historical origin may have been (probably from will), we now use it indiscriminately as a weak form for either shall or will; and very often the speaker could not tell you which he had intended. There is thus often a doubt in a speaker’s mind whether will or shall is the appropriate form; and, in this doubt, it is will that is spreading at the expense of shall, presumably because will is used more frequently than shall anyway, and so is likely to be the winner in a levelling process. So people nowadays commonly say or write I will be there, we will all die one day, and so on, when they intend to express simple futurity and not volition. (Barber 1964: 134)

Denison on shall and will During the latter part of our period [1776-present day]... in the first person shall has increasingly been replaced by will even where there is no element of volition in the meaning. (Denison 1998: 167)

The use of shall and will in written British and American English from the 1960s and 1990s Figures are normalised per million word frequencies Log likelihood LL is performed against number of words BrELOBFLOBLLdiff % will2,7982, % shall % AmEBrownFrownLLdiff % will2,7022, % shall % From: Mair and Leech (2006: 327)

Mair and Leech’s data Simply counts tagged lexical tokens –Will = auxiliary verb, includes ’ll –Shall = auxiliary verb –Includes negative forms Does not distinguish by grammatical position or context –Does not ask whether the choice is available, e.g. limit to first person use –Does not consider subclasses separately Negative cases: will not/won’t vs. shall not/shan’t? Do interrogative cases behave differently? Is written data only Can we do better than this?

An FTF for first person declarative shall This FTF is limited to first person cases –The FTF requires that the NP is realised by the pronoun I or we. Interrogative cases have a different structure We can subtract negative (shall not) cases to exclude them.

Shall vs. will Does the proportion of cases of shall out of {shall, will} change over time?  ² for first person subject; shall vs will d % =percentage difference (30% fall in shall between LLC and ICE-GB)  =an estimate of the size of the overall effect (a bit like d % )  2 =2x2 chi-square test: is this change statistically significant?  2 (shall) =2x1 goodness of fit test: does shall behave differently to average? shallwillTotal  2 (shall)  2 (will)Summary LLC d % = %  20.84% ICE-GB  = 0.17 TOTAL  2 = 8.09

Shall vs. will/’ll Does the proportion of cases of shall out of {shall, will, ’ll} change over time?  ² for first person subject; shall vs will vs. ’ll shallwill’llTotal  2 (shall)  2 (will)  2 (’ll) LLC ICE-GB TOTAL  2 (shall) =2x1 goodness of fit test: does shall behave differently to average?

Focusing on choice We focused on the choice of shall vs. will –Mair and Leech simply said that total cases of shall fell –But this might have happened for other reasons For example there may have been more opportunities to use shall in the LLC data Examining choice is a more precise way of conducting experiments than counting frequencies –It allows us to consider what variables (time, genre, other choices) affect the probability of shall being chosen Probability is a simple fraction from 0 to 1. –p(shall) =F(shall) F(shall) + F(will)+…

Probability of shall vs. will over time

Probability of shall vs. will/’ll over time

Confidence intervals Probability p(shall): 0 = no cases are of type shall 1 = all cases are of type shall Our sample is a tiny subset of possible sentences from the same period –So we cannot say a particular observation is certain –Instead we try to estimate our confidence in an observation using error bars or confidence intervals The more data we have supporting an observation p, the smaller the confidence interval around it We set a confidence level, typically of 95% –we are 95% sure that the true value is within the interval

Modal meaning Remember Barber and Denison. Not all cases of shall or will mean the same thing –Root (futurity): I’ve got some at home so I shall take it home. [DI-A18 #30] I will answer you in a minute. [DI-B30 #293] –Epistemic (volition): So I shall have roughly from the twenty-ninth of June to the eighth of July on which I can spend the whole of that time on those two papers. [DL-B01 #62] It’s certainly my long term hope that I will have some kind of companion... [DI-B53 #0257] We should examine these choices separately –Unfortunately this means classifying cases manually

Modal meaning: statistics Root shall / will is stable: results are not significant Epistemic shall / will falls (d% = -30%  27%) –The fall in shall is not explained by the sharp fall in Epistemic modals overall - from 100 (72+28) to 28 (14+14) –This is evidence that the shift in use in C20 is concentrated within Epistemic meanings, from shall to will. –Barber and Denison: earlier shift was in Root (future) meaning. shallLLC ICE-GB  sig willLLC ICE-GB Total  sig15279 Root %Epistemic %Unclear %Total

Modal meaning: statistics Shall is losing its particular Epistemic meaning as a result –In the LLC data two thirds (67%) of shall uses were Epistemic. –This fell to 37% (just over one third) in ICE-GB. shallLLC ICE-GB  sig willLLC ICE-GB Total  sig15279 Root %Epistemic %Unclear %Total

Conclusions DCPSE is –orthographically transcribed spoken English mostly spontaneous –fully parsed and checked by linguists, uses phrase structure grammar based on Quirk et al. –searchable with ICECUP and FTFs Even lexical studies benefit from parsing –allows us to focus on when a choice occurs You can use DCPSE to carry out many different experiments on real English –we looked at change over (recent) time –we might also look at how decisions interact

Conclusions Designing a Corpus Linguistic experiment means thinking carefully about your hypothesis and then attempting to test it against the corpus –We examined the shift from shall to will –We limited it to first person, declarative, positive cases –Changing baselines (including ’ll) may lead to different conclusions Many corpus studies only consider word baselines (or pmw) But it is often better to consider proportions of types of clause or phrase, or list specific alternative choices –Alternation (choice) studies aim to hold meaning constant so the speaker/writer is free to choose between both cases: We focused further by subdividing data by modal meaning

Suggested further reading On shall vs. will and the progressive: –Aarts, B. Close, J. and Wallis S.A. (forthcoming) Choices over time: methodological issues in investigating current change. In: B. Aarts et al. The changing Verb Phrase, Cambridge: CUP. –Barber, C. (1964) Linguistic Change in Present-Day English. Edinburgh and London: Oliver and Boyd. –Denison, D. (1998) Syntax. In: S. Romaine (ed.). The Cambridge History of the English Language. IV: Cambridge: Cambridge University Press –Mair, C. and Leech, G. (2006) Current changes in English syntax. In: B. Aarts and A. McMahon (ed.) The Handbook of English Linguistics. Malden MA: Blackwell Publishers On statistical tests, confidence intervals and other methods: –Wallis, S.A. (2010) z-squared: the origin and use of  2. Survey of English Usage, UCL.