Download presentation
Presentation is loading. Please wait.
1
Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3
2
2/18 Issues in corpus design General purpose vs specialized Dynamic (monitor) vs static Representativeness and balance Size Collection, permission Text capture and markup Storage and access Organizations
3
3/18 Collecting samples of speech Aim to collect natural samples Cannot tape record surreptitiously –Early corpora were done in thisa way, with permission sought afterwards –Nowadays regarded as unethical, perhaps even illegal “Observer’s paradox”: presence of recorder effects behaviour Can be overcome (somewhat) by recording lots of material and sampling from the middle
4
4/18 Collecting written samples Much easier to obtain, but beware important issue of permission –Copyrighted material cannot be freely stored and distributed –“Fair use” law allows use of up to 2,000 words for private research –Corpus samples are often >2,000 words, and often distributed widely, sometimes for profit (or at least at a price to cover/recoup costs) –Copyright laws may differ between countries
5
5/18 Permission Can be quite onerous obtaining copyright permission –Time consuming to wait for a reply to a request: do you go ahead and include it (ie start work on annotation and mark-up), or wait? –Big risk, eg English-Norwegian Parallel Corpus contains copyrighted material and can only be used by U Oslo researchers, on site!
6
6/18 Text capture Easiest if text is already machine-readable, though there may still be some issues with mark-up –eg MRT obtained from publishers may have print formatting information embedded in it –Text captured from an online source may have HTML mark-up If text exists in printed form, scanning is a possibility –OCR is generally very good quality, but text must still be carefully checked –Issue of how to deal with printing effects such as hyphenation, headers and footers, footnotes
7
7/18 Text capture: re-keying If OCR is not suitable/available –eg hand-written texts, or medium is not flat Re-keying is only option Highly expensive, time-consuming and error- prone With manuscripts, there may be an issue of “keyboarder correction” –Example of Learner English corpus of handwritten essays: important not to correct “errors” –PhD student collected handwritten essays by (Arabic) learners of English for error analysis: first task was to “type them in”
8
8/18 Handwritten text Are these capital Ts? Is this crossed out? Is this a v or a t? Is this depend or depond? etc. What does this say? Compared to these?
9
9/18 Mark-up Issues like this can be overcome by mark-up Annotate the text to show explicitly where there is anything special –Doubtful text –Incorrect text (mark up can show what was probably meant) –Extraneous material This is also an important issue in computer storage of ancient manuscripts More detail later
10
10/18 Speech corpora “Corpus” usually means transcribed speech data Many issues surrounding transcription of speech Some of them similar to issues with handwriting Others particular to speech
11
11/18 Transcribing speech Not just a matter of typing in what was said, though this is of course a major element –And may not be straightforward –How much “correction” to do in transcription –eg of hesitations, false starts, and other speech phenomena Speech corpora usually encode information about paralinguistic and non-linguistic features –Speed of delivery, pauses –Loudness (whispering, shouting, singing) –Coughs and other non-speech sounds which may be meaningful (grunt, tutting, hesitation noises) –Even outside noises if relevant (eg passing siren, music, animals), as they might “contribute” to the discussion
12
12/18 Transcribing speech Some conventions have emerged, eg … Vocalized pauses: use phonetic symbols or conventional spelling – or uh, ah, erm, uhuh (!) How to transcribe contractions like gotta, gonna, sorta, … –Notice how some are completely conventional, eg can’t, won’t How (and whether) to transcribe partially uttered words and repetitions How to represent unintelligible speech
13
13/18 Storage Where will the data be kept, and who will have access? –If corpus is for public distribution, will it be by license, or freely available? –If by license, distribute online (with password) or on CD? Nowadays, fortunately, size is not such an issue though –Big corpora have to be distributed on multiple CDs –Downloading from a website can take hours Note that it is not only the corpus data that must be distributed: –Many corpora have associated software packages to facilitate exploration –For speech corpora, original recordings may be available
14
14/18 Access Efficient access to corpus data comes hand- in-hand with corpus structure No good having structured corpus if that structure can’t be used to delimit searches Best if corpus is cross-indexed on all searchable criteria, ie all details that are encoded in headers
15
15/18 Organizations Several organizations, often based in universities, have their own corpus material, and are also very active in issues surrounding Corpus Linguistics “corpora” mailing list http://nora.hd.uib.no/corpora/http://nora.hd.uib.no/corpora/ ELRA European Language Resources Association http://www.elra.info/ http://www.elra.info/ LDC Linguistic Data Consortium http://www.ldc.upenn.edu/ http://www.ldc.upenn.edu/ TEI Text Encoding Inititative http://www.tei-c.org/http://www.tei-c.org/
16
16/18 aims to make available the language resources for language engineering and to evaluate language engineering technologies active in identification, distribution, collection, validation, standardisation, improvement promotes the production of language resources supports the infrastructure to perform evaluation campaigns –Mainly through ELDA (Evaluation and Language Resources Distribution Agency) http://www.elda.org/http://www.elda.org/ http://www.elra.info/
17
17/18 Based at U Penn supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards http://www.ldc.upenn.edu/
18
18/18 collectively develops and maintains a standard for the representation of texts in digital form chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics http://www.tei-c.org/index.xml
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.