Overview of Corpus Linguistics 10/8/14 Overview of Corpus Linguistics Ling 240 © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 1
Outline Definition History Current status
What is corpus linguistics? Linguistics: the scientific study of language using… Corpus: a large and principled collection of natural texts
History of corpus linguistics As early as 1897, Wilhelm Kaeding compiled a had 5,000 people compile a corpus of 11 million German words (and calculate their frequency, distribution of letters). In the early 1900s, Otto Jesperson, a Danish professor filled shoeboxes with thousands of paper slips containing interesting English sentences. In 1959, Randolph Quirk started the Survey of English Usage of spoken and written language which he used to create a comprehensive English grammar.
History of corpus linguistics 1961: Brown Corpus 1M words 500 samples of 2,000 words Various genres; printed, edited, American English 1961: Lancaster-Oslo/Bergen (LOB) Corpus British version of Brown Corpus 1991: Frown and FLOB Corpora 1988: International Corpus of English (ICE) World English varieties 20 completed so far
History of corpus linguistics 1991: British National Corpus (BNC) 100M words Wide range of written (90%) and spoken (10%) texts 2008: BYU Corpora Corpus of Contemporary American English (COCA) TIME corpus Corpus of Historical American English (COHA) GloWbE Corpus International Corpus of Learner English (ICLE) MICASE & MICUSP
Status of corpus linguistics Is corpus linguistics a branch of linguistics or a method for doing linguistics? Evidence for branch: Journals such as Corpora and the International Journal of Corpus Linguistics Some researchers claim corpus linguistics as their area of emphasis Evidence for method: Most linguistic phenomena can be measured using CL CL has the potential to inform virtually any theory
Characteristics of corpus-based analyses It is empirical, analyzing the actual patterns of use in natural texts; It utilizes a large and principled collection of natural texts, known as a “corpus” as the basis for analysis; It makes extensive use of computers for analysis, using both automatic and interactive techniques; It depends on both quantitative and qualitative analytical techniques
Uses of corpora Changes over time Changes in register Changes in situation Changes in individual
Time
Different Genitives Of genitive The leg of the table 's genitive The table's leg NN genitive The table leg
‘s genitive vs. of-genitive vs. NN sequence
NN sequence across time in three registers
Situation
Phrasal Compression Uncompressed Compressed The dog that was hungry was looking for something to eat. Drugs that require a prescription should be monitored Compressed The hungry dog was looking for something to eat. Prescription drugs should be monitored.
Phrasal compression across levels in an EAP reading series
Phrasal compression across levels in another EAP reading series
Individual
‘Abstract Exposition versus Concrete Action’
Corpus Design and Representativeness Ling 240
Designing Representative Corpora Many people believe that the design of a corpus doesn’t matter as long as it is large enough. Researchers typically focus on target domain representativeness and ignore linguistic representativeness Target domain (medical texts, newspapers, academic, general English, spoken) Very few corpora are actually evaluated in terms of their representativeness (target domain or linguistic)
Steps—representing the target domain Describe the target domain Design the corpus to represent target domain Complete the sampling Simple random Randomly choose sections of the data for the corpus Stratified Determine what genres are included and randomly sample from those data Cluster Divide data into naturally occurring groups and sample from them
Norming practice (raw count/total words) * 1000 Text A Text B # Nouns 50 100 # Words 200 1000 (raw count/total words) * 1000 Text A: (50 nouns / 200 words) * 1000 = 250 nouns per thousand words Text B: (100 nouns / 500 words) * 1000 = 200 nouns per thousand words
Norming practice BNC has 100 million words COCA has 450 million words # Per M snuck 11 767 sneaked 132 830
Corpus Annotation Ling 240
Annotation Corpora can be annotated for a wide range of external and internal variables. External variables Speaker L1 background Gender Extralinguistic information (e.g., laughter, nodding, etc.)
External annotation—example <Exam ID: 3B> <Arrangement ID: 54945> <Center ID: 14> <Candidate ID: 42285> <Test Date: 12/6/2013> <Age: 19> <Gender: F> <L1: Arabic> <Reason for test: B> <Original MELAB: 2> <Original Transformed: 3> <Second MELAB: > <Second Transformed: > <End header> E: Alright, welcome to the MELAB speaking exam, my name is <deleted>. And uh what is your name? T: Uh my name is uh <deleted>. E: Now I'll just uh read the MELAB ID number that we have for you. Uh you don't need to know it or anything. The number is <deleted>. Alright now that's out of the way. Uh why don't you uh start by telling me a little bit about why you're taking the MELAB today. T: Uh actually I came to USA to complete my education here, so uh if I want to go to university I need to get score and to to be good in speak English and I take uh a lab exam so I can enter to the university. E: Okay uh so uh what are you interested in studying at the university? T: Actually I think about uh medical science.
Part of speech tagging Rule-based Probability-based 95%+ accuracy rate Some features very easy (e.g., the) Some features more difficult (e.g., that) Pronoun (He doesn’t like that.) Determiner (He doesn’t like that dog.) Complementizer (They thought that they could do it.) Relativizer (The thought that I entertained.)
POS tagging accuracy Accuracy Example Precision – What percent of the cases labeled as X are actually X? Recall – What percent of all of the true cases of X were labeled as X? Example He saw that dog that I saw. If both ‘that’s are tagged as determiners: Calculate the precision and recall for determiners Calculate the precision and recall for relativizers
POS tagging—two examples CLAWS Tagger Biber Tagger Everything_PN1 I_PPIS1 've_VH0 read_VVN says_VV0 they_PPHS2 were_VBDR warned_VVN to_TO leave_VVI immediately_RR Everything ^pn++++=Everything I ^pp1a+pp1+++=I've 've ^vb+hv+aux++0=EXTRAWORD read ^vprf+++xvbnx+=read says ^vb+vpub+++=say's they ^pp3a+pp3+++=they were ^vbd+bed+aux++=were warned ^vpsv++agls+xvbnx+=warned to ^to+vcmp+++=to leave ^vbi++++=leave immediately ^rb+tm+++=immediately
Lemmatization Lemma The citation or dictionary entry Run is the lemma It includes the words run, running, runs, ran We often want the frequency of the lemma not of a particular word like running
Answer these questions about COCA What external annotation does it contain? What internal annotation does it contain?
Answer these questions about COCA What external annotation does it contain? Text source Date of publication What internal annotation does it contain? Lemmatization Part of speech Genre
Example: ‘s-’ versus ‘of-genitive’ ‘the bird’s owner’ vs. ‘the owner of the bird’ Finding 1: “by 1991, the s-genitive had overtaken the of-genitive in frequency” (Leech, et al., 2009) Finding 2: of-genitive is almost 10 times more frequent than the s-genitive in present-day English (Longman Grammar) Q: Are these findings contradictory??? 34
Corpus study design—variationist Two approaches to corpus linguistics: Variationist and Text-Linguistic (Biber, 2012) Variationist: “has the goal of comparing linguistic variants: whether one or the other variant is preferred,” and identifying factors that predict which variant is used (Biber, 2012). Statistics: Binomial/logistic regression; Linear discriminant analysis Interpretation: When a choice can be made, variant X is preferred over variant Y, and factors A, B, and C play a role. 35
Variationist Analysis (Type A) Unit of analysis is linguistic feature Most studies do not take register into account (e.g. collocational studies) Comparison of the proportion of use in a particular register E.g., Benedict Szmrecsanyi & Hinrichs, 2008 preference of s-genitive over of-genitive in speech; BUT: s-genitives overall more frequent in writing.
Corpus study design—text-linguistic Text-Linguistic: “has the goal of providing a linguistic description of texts, by describing the density of grammatical features in texts” (Biber, 2012) Statistics: T-test, ANOVA, Multiple regression, Factor analysis Interpretation: Feature X is more frequent in context A than context B; or Feature X is more frequent than feature Y 37
Text-linguistic (Type B) Comparison of actual frequency of use in a particular register Unit of analysis is text Normed rates of occurrence by text Much more common for register studies
Text-linguistic (Type C) Also compares frequencies of use in a particular register Unit of analysis is subcorpus Normed rates of occurrence for features across subcorpora Cannot use inferential statistics (need to look at individual text to get a mean score for the register)
Quantitative analysis Coding/tagging features Counts in text vs. subcorpus Norming (raw count/total words * 1000) Use appropriate statistical tests if applicable
Kinds of Corpora Spoken language General corpora (mainly written) Bitext (two languages side-by-side) Specialized Children’s speech L2 learner speech Historical
General Corpora Mainly written British National Corpus (BNC) 100 million words 10% spoken 25% fiction 75% non-fiction
General Corpora Corpus of Contemporary American English (COCA) 450 million words (more added every year) Divided into registers Spoken Fiction Academic
General Corpora International Corpus of English 1 million words from each country 60% spoken, 40% written
Historical Corpora Helsinki Corpus English texts from 770-1700 Corpus of Historical American English (COHA) 1860-present
Introduction to COHA Corpus of Historical American English End up verbing Try and verb versus try to verb Adjectives and nouns used in 2000s not before Collocates of Muslim, liberal, Mormon ?
Raw Corpora Not easily searchable Not tagged Project Gutenberg Pre 1928 books (copyright expired) Online newspapers Time Magazine The internet General Conference
Where can you get corpora? Online BNC, COCA, COHA Distributors (membership required) ELRA (based in Europe) Linguistic Data Consortium US based BYU has a membership Catalog Top 10 corpora