Download presentation
Presentation is loading. Please wait.
1
Overview of Corpus Linguistics
10/8/14 Overview of Corpus Linguistics Ling 240 © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 1
2
Outline Definition History Current status
3
What is corpus linguistics?
Linguistics: the scientific study of language using… Corpus: a large and principled collection of natural texts
4
History of corpus linguistics
As early as 1897, Wilhelm Kaeding compiled a had 5,000 people compile a corpus of 11 million German words (and calculate their frequency, distribution of letters). In the early 1900s, Otto Jesperson, a Danish professor filled shoeboxes with thousands of paper slips containing interesting English sentences. In 1959, Randolph Quirk started the Survey of English Usage of spoken and written language which he used to create a comprehensive English grammar.
5
History of corpus linguistics
1961: Brown Corpus 1M words 500 samples of 2,000 words Various genres; printed, edited, American English 1961: Lancaster-Oslo/Bergen (LOB) Corpus British version of Brown Corpus 1991: Frown and FLOB Corpora 1988: International Corpus of English (ICE) World English varieties 20 completed so far
6
History of corpus linguistics
1991: British National Corpus (BNC) 100M words Wide range of written (90%) and spoken (10%) texts 2008: BYU Corpora Corpus of Contemporary American English (COCA) TIME corpus Corpus of Historical American English (COHA) GloWbE Corpus International Corpus of Learner English (ICLE) MICASE & MICUSP
7
Status of corpus linguistics
Is corpus linguistics a branch of linguistics or a method for doing linguistics? Evidence for branch: Journals such as Corpora and the International Journal of Corpus Linguistics Some researchers claim corpus linguistics as their area of emphasis Evidence for method: Most linguistic phenomena can be measured using CL CL has the potential to inform virtually any theory
8
Characteristics of corpus-based analyses
It is empirical, analyzing the actual patterns of use in natural texts; It utilizes a large and principled collection of natural texts, known as a “corpus” as the basis for analysis; It makes extensive use of computers for analysis, using both automatic and interactive techniques; It depends on both quantitative and qualitative analytical techniques
9
Uses of corpora Changes over time Changes in register
Changes in situation Changes in individual
10
Time
11
Different Genitives Of genitive The leg of the table 's genitive
The table's leg NN genitive The table leg
12
‘s genitive vs. of-genitive vs. NN sequence
13
NN sequence across time in three registers
14
Situation
15
Phrasal Compression Uncompressed Compressed
The dog that was hungry was looking for something to eat. Drugs that require a prescription should be monitored Compressed The hungry dog was looking for something to eat. Prescription drugs should be monitored.
16
Phrasal compression across levels in an EAP reading series
17
Phrasal compression across levels in another EAP reading series
18
Individual
19
‘Abstract Exposition versus Concrete Action’
20
Corpus Design and Representativeness
Ling 240
21
Designing Representative Corpora
Many people believe that the design of a corpus doesn’t matter as long as it is large enough. Researchers typically focus on target domain representativeness and ignore linguistic representativeness Target domain (medical texts, newspapers, academic, general English, spoken) Very few corpora are actually evaluated in terms of their representativeness (target domain or linguistic)
22
Steps—representing the target domain
Describe the target domain Design the corpus to represent target domain Complete the sampling Simple random Randomly choose sections of the data for the corpus Stratified Determine what genres are included and randomly sample from those data Cluster Divide data into naturally occurring groups and sample from them
23
Norming practice (raw count/total words) * 1000 Text A Text B # Nouns
50 100 # Words 200 1000 (raw count/total words) * 1000 Text A: (50 nouns / 200 words) * 1000 = 250 nouns per thousand words Text B: (100 nouns / 500 words) * 1000 = 200 nouns per thousand words
24
Norming practice BNC has 100 million words COCA has 450 million words
# Per M snuck 11 767 sneaked 132 830
25
Corpus Annotation Ling 240
26
Annotation Corpora can be annotated for a wide range of external and internal variables. External variables Speaker L1 background Gender Extralinguistic information (e.g., laughter, nodding, etc.)
27
External annotation—example
<Exam ID: 3B> <Arrangement ID: 54945> <Center ID: 14> <Candidate ID: 42285> <Test Date: 12/6/2013> <Age: 19> <Gender: F> <L1: Arabic> <Reason for test: B> <Original MELAB: 2> <Original Transformed: 3> <Second MELAB: > <Second Transformed: > <End header> E: Alright, welcome to the MELAB speaking exam, my name is <deleted>. And uh what is your name? T: Uh my name is uh <deleted>. E: Now I'll just uh read the MELAB ID number that we have for you. Uh you don't need to know it or anything. The number is <deleted>. Alright now that's out of the way. Uh why don't you uh start by telling me a little bit about why you're taking the MELAB today. T: Uh actually I came to USA to complete my education here, so uh if I want to go to university I need to get score and to to be good in speak English and I take uh a lab exam so I can enter to the university. E: Okay uh so uh what are you interested in studying at the university? T: Actually I think about uh medical science.
28
Part of speech tagging Rule-based Probability-based 95%+ accuracy rate
Some features very easy (e.g., the) Some features more difficult (e.g., that) Pronoun (He doesn’t like that.) Determiner (He doesn’t like that dog.) Complementizer (They thought that they could do it.) Relativizer (The thought that I entertained.)
29
POS tagging accuracy Accuracy Example
Precision – What percent of the cases labeled as X are actually X? Recall – What percent of all of the true cases of X were labeled as X? Example He saw that dog that I saw. If both ‘that’s are tagged as determiners: Calculate the precision and recall for determiners Calculate the precision and recall for relativizers
30
POS tagging—two examples
CLAWS Tagger Biber Tagger Everything_PN1 I_PPIS1 've_VH0 read_VVN says_VV0 they_PPHS2 were_VBDR warned_VVN to_TO leave_VVI immediately_RR Everything ^pn++++=Everything I ^pp1a+pp1+++=I've 've ^vb+hv+aux++0=EXTRAWORD read ^vprf+++xvbnx+=read says ^vb+vpub+++=say's they ^pp3a+pp3+++=they were ^vbd+bed+aux++=were warned ^vpsv++agls+xvbnx+=warned to ^to+vcmp+++=to leave ^vbi++++=leave immediately ^rb+tm+++=immediately
31
Lemmatization Lemma The citation or dictionary entry Run is the lemma
It includes the words run, running, runs, ran We often want the frequency of the lemma not of a particular word like running
32
Answer these questions about COCA
What external annotation does it contain? What internal annotation does it contain?
33
Answer these questions about COCA
What external annotation does it contain? Text source Date of publication What internal annotation does it contain? Lemmatization Part of speech Genre
34
Example: ‘s-’ versus ‘of-genitive’
‘the bird’s owner’ vs. ‘the owner of the bird’ Finding 1: “by 1991, the s-genitive had overtaken the of-genitive in frequency” (Leech, et al., 2009) Finding 2: of-genitive is almost 10 times more frequent than the s-genitive in present-day English (Longman Grammar) Q: Are these findings contradictory??? 34
35
Corpus study design—variationist
Two approaches to corpus linguistics: Variationist and Text-Linguistic (Biber, 2012) Variationist: “has the goal of comparing linguistic variants: whether one or the other variant is preferred,” and identifying factors that predict which variant is used (Biber, 2012). Statistics: Binomial/logistic regression; Linear discriminant analysis Interpretation: When a choice can be made, variant X is preferred over variant Y, and factors A, B, and C play a role. 35
36
Variationist Analysis (Type A)
Unit of analysis is linguistic feature Most studies do not take register into account (e.g. collocational studies) Comparison of the proportion of use in a particular register E.g., Benedict Szmrecsanyi & Hinrichs, 2008 preference of s-genitive over of-genitive in speech; BUT: s-genitives overall more frequent in writing.
37
Corpus study design—text-linguistic
Text-Linguistic: “has the goal of providing a linguistic description of texts, by describing the density of grammatical features in texts” (Biber, 2012) Statistics: T-test, ANOVA, Multiple regression, Factor analysis Interpretation: Feature X is more frequent in context A than context B; or Feature X is more frequent than feature Y 37
38
Text-linguistic (Type B)
Comparison of actual frequency of use in a particular register Unit of analysis is text Normed rates of occurrence by text Much more common for register studies
39
Text-linguistic (Type C)
Also compares frequencies of use in a particular register Unit of analysis is subcorpus Normed rates of occurrence for features across subcorpora Cannot use inferential statistics (need to look at individual text to get a mean score for the register)
40
Quantitative analysis
Coding/tagging features Counts in text vs. subcorpus Norming (raw count/total words * 1000) Use appropriate statistical tests if applicable
41
Kinds of Corpora Spoken language General corpora (mainly written)
Bitext (two languages side-by-side) Specialized Children’s speech L2 learner speech Historical
42
General Corpora Mainly written British National Corpus (BNC)
100 million words 10% spoken 25% fiction 75% non-fiction
43
General Corpora Corpus of Contemporary American English (COCA)
450 million words (more added every year) Divided into registers Spoken Fiction Academic
44
General Corpora International Corpus of English
1 million words from each country 60% spoken, 40% written
45
Historical Corpora Helsinki Corpus English texts from 770-1700
Corpus of Historical American English (COHA) 1860-present
46
Introduction to COHA Corpus of Historical American English
End up verbing Try and verb versus try to verb Adjectives and nouns used in 2000s not before Collocates of Muslim, liberal, Mormon ?
47
Raw Corpora Not easily searchable Not tagged Project Gutenberg
Pre 1928 books (copyright expired) Online newspapers Time Magazine The internet General Conference
48
Where can you get corpora?
Online BNC, COCA, COHA Distributors (membership required) ELRA (based in Europe) Linguistic Data Consortium US based BYU has a membership Catalog Top 10 corpora
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.