LELA 30922 English Corpus Linguistics Harold Somers Professor of Language Engineering Office: Lamb 1.15
Syllabus
Assessment A practical project in which students will use the BNC (or other approved corpus material) to investigate some question of English language usage. Suggestion: base your project (more or less closely) on some existing study. Project write-up will include relevant background material and results and discussion of a corpus-based analysis. In other words: summarize (and criticize) the chosen study, then do your own version, and compare the results
Reading matter Main recommendations: Kennedy, G.D. (1998) An introduction to corpus linguistics. London: Longman. McEnery, T. & A. Wilson (2001, 2nd ed) Corpus linguistics. Edinburgh: Edinburgh University Press. Meyer, C. (2002) English corpus Linguistics: An introduction. Cambridge: Cambridge University Press. Lots of other books, focussing on particular aspects Do not ignore journals (Int J Corp Ling) and specialist conferences, especially when considering practical assignment. http://tinyurl.com/32abhb for list of resources available at UoM
What is a corpus? Corpus (pl. corpora) = ‘body’ Collection of written text or transcribed speech Usually but not necessarily purposefully collected Usually but not necessarily structured Usually but not necessarily annotated (Usually stored on and accessible via computer) Corpus ~ text archive
Computers and corpus linguistics Historically, manual analysis of large bodies of text (esp. in literary and biblical studies) Error-prone, time-consuming, not verifiable Computers have introduced Reliability, accuracy and replicability increased speed and capacity means you can do more on a grander scale new tools mean you can do things you might not have thought of doing
What is corpus linguistics? Not a branch of linguistics, like socio~, psycho~, … Not a theory of linguistics A set of tools and methods (and a philosophy) to support linguistic investigation across all branches of the subject
Evidence in linguistics Real attested usage as linguistic evidence Contrasts with introspective approach previously typical Relates to the competence~performance (langue~parole) distinction Corpus linguists often more interested in trends than rules (probabilities rather than certainties) Famous stories of corpus evidence contradicting widely-held assumptions about language use.
Activities in corpus linguistics Design and compilation of corpora Development of tools for corpus analysis Descriptive linguists using corpora to analyze lexical and grammatical behaviour of language, eg for lexicography Exploiting corpora in applied linguistics – language teaching, translation.
History of Corpus Linguistics www. essex. ac History of Corpus Linguistics www.essex.ac.uk/linguistics/clmt/w3c/corpus_ling/content/history.html Textual study has always included an element of counting and cataloguing, despite impracticalities – notably concordances of Shakespeare, the Bible, etc. Arrival of computers in 1950s of course changed everything
Brown corpus First modern computer-readable corpus W.N. Francis and H. Kucera, Brown University, Providence, RI one million words of American English texts printed in 1961 sampled from 15 different text categories used as model for other corpora, including …
LOB corpus compiled by researchers in Lancaster, Oslo and Bergen one million words of British English texts printed in 1961 sampled from same 15 text categories as Brown corpus All texts ≤ 2,000 words long Kolhapur corpus of Indian English compiled in 1978 to same sepcification
Chomsky’s criticisms Chomsky’s ideas drove linguists away from empiricism (data) towards rationalism (introspection) Chomsky switched focus onto abstract models of language competence He was especially scathing about corpus-based approaches Based on mistaken view that corpus linguists confused finiteness of data with finiteness of language See McEnery & Wilson, chapter 1
The London-Lund Corpus of Spoken English (LLC) First corpus of transcribed spoken language Part of Survey of Spoken English at Lund University under the direction of J. Svartvik 500,000 words of spoken British English recorded from 1953 to 1987 different categories, such as spontaneous conversation, spontaneous commentary, spontaneous and prepared oration
COBUILD 1m-word corpus too small for many applications 1980: Collins instigated collection of 20m-word corpus to support lexicographers writing new Collins Birmingham University International Learners’ Dictionary (John Sinclair) Now expanded to Bank of English corpus, 320m words and growing www.collins.co.uk/Corpus/CorpusSearch.aspx www.collins.co.uk/books.aspx?group=153
BNC (1995) http://www.natcorp.ox.ac.uk/ 100m word collection of written and spoken text from 1975-93 (already dated in some respects!) Carefully designed and balanced Corpus is closed (finite, synchronic) All text tagged to high quality Lots of tools available for exploration
etc. Many other corpus projects now underway, sometimes modelled on BNC or other well-known corpora Various national projects Specialized corpora Historical texts Learner English International English Translated English Spoken dialogues for certain domains When widely used, they become a kind of benchmark, eg Wall Street Journal corpus (treebank) This can have pros and cons