COGS Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I
COGS Bilge Say2 Related Readings Readings: (Course Pack): Tognini-Bonelli (2001) Corpus Issues. Ch3 McEnery et al(2006) Unit A7-A9, B1 –all appear to be one article in the course pack Meyer (2002) Planning the Construction of a corpus. Ch 2. Optional : PennTreebank and Czech National Corpus articles from Course Pack McEnery and Wilson (2001) Chs 2 and 3 Also Available in Sampson and McCarthy (2005) Anthology: Biber (1993) Representativeness in Corpus Design. Literary and Linguistic Computing 8(4) Atkins, Clear and Otkins (1992) Corpus Design Criteria. Literary and Linguistic Computing, 7(1)
COGS Bilge Say3 What is a Corpus? Text/Speech/ Video Annotation + Written/Spoken Language Derlem (alt. Bütünce) Digital media Design Criteria
COGS Bilge Say4 Stages of Corpus Building-I (aka as Corpus Compilation) Specifications and Design Develop Infrastructure and Find Funding !!! Sampling, Representativeness, Balance, Copyright issues Piloting Planning Manpower Preparation of an Annotation Manual Acquisition or Development of Software for Annotation Technical Equipment Acquisition Design and Development of Corpus Query Tools Design of Change Management Processes
COGS Bilge Say5 Stages of Corpus Building-II Data capture and Preprocessing Transcription, Tokenization, Error Correction Annotation (Markup) User Documentation All these accompanied by cyclic quality control processes and beta releases for user feedback
COGS Bilge Say6 Representativeness and Balance Balance: Weightings between different sections of a corpus, according to its design purpose Representativeness: The findings from an idealized representative corpus should be generalizable to whole language or a specified part of it. What is the relationship between balance and representativeness? Is ideal representativeness possible?
COGS Bilge Say7 Ways to Approach Sampling Elitist – Based on Literary and Academic Merit Popularity Typicalness Availability Random (or sampling out of a National Library Holdings for example)
COGS Bilge Say8 More about sampling Choose a sampling frame: identify a specific population to make generalizations about For BNC spoken part: United Kingdom was divided into 12 regions of 30 sampling points selected based on their demographic profile. Gender balance: may be hard to get in some genres Who is native? ICE-US: had lived in USA and spoken American English since years of age Education Levels, Age, Dialect Variation
COGS Bilge Say9 Spoken Data Sampling Elicited – MapTask corpus Natural - Self-recording Origins (immigrancy/nativeness, age,gender,geographic district, dialect) Dialogues vs Monologues
COGS Bilge Say10 Something in between Netspeak: blogs, chatrooms, SMSs... Pre-prepared speeches...
COGS Bilge Say11 Minimal Criteria for a Balanced General Corpus Suggested by Sinclair (91) Fiction vs Nonfiction Book, journal vs newspaper Formal vs informal Control of age, gender, and origin of authors
COGS Bilge Say12 Idealized vs Opportunistic Representativeness Measuring exposures (perception) Measuring production Purely frequency based estimate: 90% conversation, 3% letters or notes, 7% press reportage, fiction, lectures etc. Distinguishing genre, register, text type
COGS Bilge Say13
COGS Bilge Say14 Size How many tokens are enough to discover the patterns of collocation, polysemy, morphology, syntax, discourse etc? millions words suggested by Sinclair in 1991 for a general,small useful corpus 100 million words CNC, BNC 100 million words core, several hundred more as periphery for ANC
COGS Bilge Say15 Types vs Tokens Hapax Legomana (Greek for “read only once”) Almost half of the word types occur only once in the corpus 1 million word corpus – 100 word types occur more than 1000 times 100 million word corpus – 8000 word types can be expected to occur more than 1000 times – 95% of tokens. Remaining 5% - ½ million word types.
COGS Bilge Say16 General Guidelines Prosody – words of spontaneous speech 1 million – verb form morphology, some syntactic processes, high frequency vocabulary Cross-linguistics and scientific studies are rare! Always collect ~10% more than your aim. Despite best effort for quality control, you may have to discard some data.
COGS Bilge Say17 Individual Sample Size 2000 words (first generation corpora) Varied vs fixed- BNC varies, as much as Fixed size: what if something is too small or too big? Newspapers: “constructed week” concept words (Ooostdijk, 88) words from texts from each genre (Based on Biber’s 1990 study of 10 linguistic features from 55 pairs of samples from LOB and LLC) May be an issue for copyright!
COGS Bilge Say18 (Meyer, 2002)
COGS Bilge Say19
COGS Bilge Say20 (part of Table 2.1 in Meyer (2002)) Speech TypeNumber of TextNumber of Words% of Spoken Corpus Demographically Sampled1534,211,21641% Educational1441,265,31812% Business1361,321,84413% Institutional2411,345,69413% Leisure1871,459,41914% Unclassified54761,9737% Total91510,365,464100% The composition of the British National Corpus
COGS Bilge Say21 Writing TypeNumber of TextNumber of Words% of Written Corpus Imaginative62519,664,30922% Natural Science1443,752,6594% Applied Science3647,369,2908% Social Science51013,290,44115% World Affairs45316,507,39918% Commerce2847,118,3218% Arts2597,523,8468% Blief & thought Leissure3749,990,08011% Unclassified501,740,5272% Total320989,740,55499% (part of Table 2.1 in Meyer (2002) The composition of the British National Corpus
COGS Bilge Say22 Speech TypeNumber of TextNumber of Words% of Spoken Corpus Dialogues180360,00059% Private (direct conversions, distance conversions) ,00033% Public (class lessons, broadcast discussions, broadcast interviews, parliamentary debates, legal cross- examinations, business transactions) 80160,00026% Monologues120240,00040% Unscripted (spontaneous commentaries, speeches, demonstrations, legal presentations) 70140,00023% Scripted (broadcast news, broadcast talks, speeches (not broadcast)) 50100,00017% Total300600,00099% Composition of the ICE (part of Table 2.2 in Meyer (2002))
COGS Bilge Say23 Copyright Issues Publishers science vs commercial aims conflict check who has the copyright have written signed agreements status of some sources might be disputable: still have written and signed agreements Individuals Their informed consent, give guarantee of being non-identified
COGS Bilge Say24 Collecting and Computerizing Samples Written Text Scanning (introduces OCR errors) Electronic Documents (different formats, different character sets) Uploading documents (See ANC web site) Spoken Text Inform participants of your aim and that there is no linguistically “correct” Turkish etc. Record longer than needed (2000 word sample minutes needed, collect 30 mins) so that you can cut off unnatural parts in the beginning Record in natural environments Invest in good equipment and good software Even like that, 4 out 10 samples may be unusable (Meyer, 2002)
COGS Bilge Say25 Recording Information About Samples File headings – Annotation schemes like TEI account for that Bibliographical info, ethnographic info, recording info, annotation info etc. Directory Structures and File names Usable – for the builders, for the users?
COGS Bilge Say27 Lecture 3 Corpus Design II (Annotation) Readings: Meyer (2002) Ch4; Sampson and McCarthy (2005) Ch 39; Garside (1997) Chs 4,5,16 Inform me and Ayisigi (in writing) of your chosen corpus tool for software review by 17 March. Precheck w. Ayisigi that the tools suits the task criteria.