Presentation is loading. Please wait.

Presentation is loading. Please wait.

11.06.2016COGS 523 - Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I.

Similar presentations


Presentation on theme: "11.06.2016COGS 523 - Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I."— Presentation transcript:

1 11.06.2016COGS 523 - Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I

2 11.06.2016COGS 523 - Bilge Say2 Related Readings Readings: (Course Pack): Tognini-Bonelli (2001) Corpus Issues. Ch3 McEnery et al(2006) Unit A7-A9, B1 –all appear to be one article in the course pack Meyer (2002) Planning the Construction of a corpus. Ch 2. Optional : PennTreebank and Czech National Corpus articles from Course Pack McEnery and Wilson (2001) Chs 2 and 3 Also Available in Sampson and McCarthy (2005) Anthology: Biber (1993) Representativeness in Corpus Design. Literary and Linguistic Computing 8(4) Atkins, Clear and Otkins (1992) Corpus Design Criteria. Literary and Linguistic Computing, 7(1)

3 11.06.2016COGS 523 - Bilge Say3 What is a Corpus? Text/Speech/ Video Annotation + Written/Spoken Language Derlem (alt. Bütünce) Digital media Design Criteria

4 11.06.2016COGS 523 - Bilge Say4 Stages of Corpus Building-I (aka as Corpus Compilation) Specifications and Design Develop Infrastructure and Find Funding !!! Sampling, Representativeness, Balance, Copyright issues Piloting Planning Manpower Preparation of an Annotation Manual Acquisition or Development of Software for Annotation Technical Equipment Acquisition Design and Development of Corpus Query Tools Design of Change Management Processes

5 11.06.2016COGS 523 - Bilge Say5 Stages of Corpus Building-II Data capture and Preprocessing Transcription, Tokenization, Error Correction Annotation (Markup) User Documentation All these accompanied by cyclic quality control processes and beta releases for user feedback

6 11.06.2016COGS 523 - Bilge Say6 Representativeness and Balance Balance: Weightings between different sections of a corpus, according to its design purpose Representativeness: The findings from an idealized representative corpus should be generalizable to whole language or a specified part of it. What is the relationship between balance and representativeness? Is ideal representativeness possible?

7 11.06.2016COGS 523 - Bilge Say7 Ways to Approach Sampling Elitist – Based on Literary and Academic Merit Popularity Typicalness Availability Random (or sampling out of a National Library Holdings for example)

8 11.06.2016COGS 523 - Bilge Say8 More about sampling Choose a sampling frame: identify a specific population to make generalizations about For BNC spoken part: United Kingdom was divided into 12 regions of 30 sampling points selected based on their demographic profile. Gender balance: may be hard to get in some genres Who is native? ICE-US: had lived in USA and spoken American English since 10-12 years of age Education Levels, Age, Dialect Variation

9 11.06.2016COGS 523 - Bilge Say9 Spoken Data Sampling Elicited – MapTask corpus Natural - Self-recording Origins (immigrancy/nativeness, age,gender,geographic district, dialect) Dialogues vs Monologues

10 11.06.2016COGS 523 - Bilge Say10 Something in between Netspeak: blogs, chatrooms, SMSs... Pre-prepared speeches...

11 11.06.2016COGS 523 - Bilge Say11 Minimal Criteria for a Balanced General Corpus Suggested by Sinclair (91) Fiction vs Nonfiction Book, journal vs newspaper Formal vs informal Control of age, gender, and origin of authors

12 11.06.2016COGS 523 - Bilge Say12 Idealized vs Opportunistic Representativeness Measuring exposures (perception) Measuring production Purely frequency based estimate: 90% conversation, 3% letters or notes, 7% press reportage, fiction, lectures etc. Distinguishing genre, register, text type

13 11.06.2016COGS 523 - Bilge Say13

14 11.06.2016COGS 523 - Bilge Say14 Size How many tokens are enough to discover the patterns of collocation, polysemy, morphology, syntax, discourse etc? 10-20 millions words suggested by Sinclair in 1991 for a general,small useful corpus 100 million words CNC, BNC 100 million words core, several hundred more as periphery for ANC

15 11.06.2016COGS 523 - Bilge Say15 Types vs Tokens Hapax Legomana (Greek for “read only once”) Almost half of the word types occur only once in the corpus 1 million word corpus – 100 word types occur more than 1000 times 100 million word corpus – 8000 word types can be expected to occur more than 1000 times – 95% of tokens. Remaining 5% - ½ million word types.

16 11.06.2016COGS 523 - Bilge Say16 General Guidelines Prosody – 100.000 words of spontaneous speech 1 million – verb form morphology, some syntactic processes, high frequency vocabulary Cross-linguistics and scientific studies are rare! Always collect ~10% more than your aim. Despite best effort for quality control, you may have to discard some data.

17 11.06.2016COGS 523 - Bilge Say17 Individual Sample Size 2000 words (first generation corpora) Varied vs fixed- BNC varies, as much as 40.000. Fixed size: what if something is too small or too big? Newspapers: “constructed week” concept 20.000 words (Ooostdijk, 88) 2000-5000 words from 20-80 texts from each genre (Based on Biber’s 1990 study of 10 linguistic features from 55 pairs of samples from LOB and LLC) May be an issue for copyright!

18 11.06.2016COGS 523 - Bilge Say18 (Meyer, 2002)

19 11.06.2016COGS 523 - Bilge Say19

20 11.06.2016COGS 523 - Bilge Say20 (part of Table 2.1 in Meyer (2002)) Speech TypeNumber of TextNumber of Words% of Spoken Corpus Demographically Sampled1534,211,21641% Educational1441,265,31812% Business1361,321,84413% Institutional2411,345,69413% Leisure1871,459,41914% Unclassified54761,9737% Total91510,365,464100% The composition of the British National Corpus

21 11.06.2016COGS 523 - Bilge Say21 Writing TypeNumber of TextNumber of Words% of Written Corpus Imaginative62519,664,30922% Natural Science1443,752,6594% Applied Science3647,369,2908% Social Science51013,290,44115% World Affairs45316,507,39918% Commerce2847,118,3218% Arts2597,523,8468% Blief & thought14630536720.03 Leissure3749,990,08011% Unclassified501,740,5272% Total320989,740,55499% (part of Table 2.1 in Meyer (2002) The composition of the British National Corpus

22 11.06.2016COGS 523 - Bilge Say22 Speech TypeNumber of TextNumber of Words% of Spoken Corpus Dialogues180360,00059% Private (direct conversions, distance conversions) 100200,00033% Public (class lessons, broadcast discussions, broadcast interviews, parliamentary debates, legal cross- examinations, business transactions) 80160,00026% Monologues120240,00040% Unscripted (spontaneous commentaries, speeches, demonstrations, legal presentations) 70140,00023% Scripted (broadcast news, broadcast talks, speeches (not broadcast)) 50100,00017% Total300600,00099% Composition of the ICE (part of Table 2.2 in Meyer (2002))

23 11.06.2016COGS 523 - Bilge Say23 Copyright Issues Publishers science vs commercial aims conflict check who has the copyright have written signed agreements status of some sources might be disputable: still have written and signed agreements Individuals Their informed consent, give guarantee of being non-identified

24 11.06.2016COGS 523 - Bilge Say24 Collecting and Computerizing Samples Written Text Scanning (introduces OCR errors) Electronic Documents (different formats, different character sets) Uploading documents (See ANC web site) Spoken Text Inform participants of your aim and that there is no linguistically “correct” Turkish etc. Record longer than needed (2000 word sample- 10-20 minutes needed, collect 30 mins) so that you can cut off unnatural parts in the beginning Record in natural environments Invest in good equipment and good software Even like that, 4 out 10 samples may be unusable (Meyer, 2002)

25 11.06.2016COGS 523 - Bilge Say25 Recording Information About Samples File headings – Annotation schemes like TEI account for that Bibliographical info, ethnographic info, recording info, annotation info etc. Directory Structures and File names Usable – for the builders, for the users?

26

27 11.06.2016COGS 523 - Bilge Say27 Lecture 3 Corpus Design II (Annotation) Readings: Meyer (2002) Ch4; Sampson and McCarthy (2005) Ch 39; Garside (1997) Chs 4,5,16 Inform me and Ayisigi (in writing) of your chosen corpus tool for software review by 17 March. Precheck w. Ayisigi that the tools suits the task criteria.


Download ppt "11.06.2016COGS 523 - Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I."

Similar presentations


Ads by Google