Corpus Linguistics and Corpora
Corpus Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies. Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies.
Corpus Linguistics Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts. Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts. (cf. Crystal, David An Encyclopedic Dictionary of Language and Languages. Oxford, 85) (cf. Crystal, David An Encyclopedic Dictionary of Language and Languages. Oxford, 85)
Corpus CORPUS (13c: from Latin corpus body. The plural is usually corpora) (1) A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse……….. CORPUS (13c: from Latin corpus body. The plural is usually corpora) (1) A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse……….. (cf. McArthur, Tom 1992 "Corpus", The Oxford Companion to the English Language. Oxford, ) (cf. McArthur, Tom 1992 "Corpus", The Oxford Companion to the English Language. Oxford, )
Chomsky 1957 "Any natural corpus will be skewed. Some sentences won't occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [of language based on the corpus] would be no more than a mere list. " Syntactic structures. The Hague, 159 "Any natural corpus will be skewed. Some sentences won't occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [of language based on the corpus] would be no more than a mere list. " Syntactic structures. The Hague, 159
Fillmore 1992 "I have two main observations to make. "I have two main observations to make. The first is that I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore; all that I have seen are inadequate. The first is that I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore; all that I have seen are inadequate.
Fillmore 1992 The second observation is that every corpus that I've had a chance to examine, however small, has taught me facts that I couldn't imagine finding out about in any other way." The second observation is that every corpus that I've had a chance to examine, however small, has taught me facts that I couldn't imagine finding out about in any other way." In "Corpus linguistics" or "Computer-aided armchair linguistics", in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics. Berlin/New York, 35. In "Corpus linguistics" or "Computer-aided armchair linguistics", in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics. Berlin/New York, 35.
Types of corpus Monolingual corpora - in which the texts are all in the same language Monolingual corpora - in which the texts are all in the same language Parallel and/or aligned corpora - in which originals and translations are aligned so that both texts are synchronized to appear on the screen together and it is easy to see how the translator has translated the original. Parallel and/or aligned corpora - in which originals and translations are aligned so that both texts are synchronized to appear on the screen together and it is easy to see how the translator has translated the original.
Types of corpus Comparable corpora - in which a selection of original texts has been made in two or more languages dealing with the same subject or genre. Comparable corpora - in which a selection of original texts has been made in two or more languages dealing with the same subject or genre. Concurrent corpora - a term used to describe texts taken from newspapers on the same subject on approximately the same dates. Concurrent corpora - a term used to describe texts taken from newspapers on the same subject on approximately the same dates.
Types of corpus Specialized corpora - texts on specialized subjects. The principal use for these corpora is the extraction of terminology and complementary explanatory material - definitions, explanations, semantic relations etc Specialized corpora - texts on specialized subjects. The principal use for these corpora is the extraction of terminology and complementary explanatory material - definitions, explanations, semantic relations etc
Types of corpus 'Do-it-yourself ' corpora - a term coined by those of us using small specialized corpora for the purpose of teaching translation or language 'Do-it-yourself ' corpora - a term coined by those of us using small specialized corpora for the purpose of teaching translation or language Disposable corpora - the same as 'do-it- yourself' corpora, but taking into account that such corpora need to be disposed of after use so that their users do not get into trouble with copyright restrictions. Disposable corpora - the same as 'do-it- yourself' corpora, but taking into account that such corpora need to be disposed of after use so that their users do not get into trouble with copyright restrictions.
How do you search a corpus? Concordancing Concordancing Sentence level – see BNC Sentence level – see BNC COMPARA – parallel concordance COMPARA – parallel concordance
The Survey of English Usage 60s - Randolph Quirk et al > launched the Survey of English Usage (SEU) 60s - Randolph Quirk et al > launched the Survey of English Usage (SEU) "with the aim of collecting a large and stylistically varied corpus as the basis for a systematic description of spoken and written English"with the aim of collecting a large and stylistically varied corpus as the basis for a systematic description of spoken and written English
The Survey of English Usage Brown, Lancaster-Oslo/Bergen (LOB) and London-Lund Corpus of Spoken EnglishBrown, Lancaster-Oslo/Bergen (LOB) and London-Lund Corpus of Spoken English See ICAME - International Computer Archive of Modern and Medieval English at the Norwegian Computing Centre for the Humanities at ICAME - International Computer Archive of Modern and Medieval English at the Norwegian Computing Centre for the Humanities at
The Survey of English Usage Today at University of London at Today at University of London at ICE - the International Corpus of English ICE - the International Corpus of English Download the sampler of this corpus fully tagged and analysed from usage/ice-gb/sampler/form.htm Download the sampler of this corpus fully tagged and analysed from usage/ice-gb/sampler/form.htm usage/ice-gb/sampler/form.htm usage/ice-gb/sampler/form.htm
Quality versus quantity A small but fully analyzed and tagged - e.g. early corpora and ICE (1 million words) A small but fully analyzed and tagged - e.g. early corpora and ICE (1 million words) British National Corpus – 100 million words British National Corpus – 100 million words Other corpora Other corpora Bank of English millionBank of English million The Internet The Internet
Corpora, lexicography & terminology Lexicography BEFORE corpora Lexicography BEFORE corpora Emphasis on etymologyEmphasis on etymology Complex definitionsComplex definitions Usage based on intuitions of lexicographersUsage based on intuitions of lexicographers Terminology BEFORE corpora Terminology BEFORE corpora Standardization > one word= one concept, rigid definitionsStandardization > one word= one concept, rigid definitions Paper dictionaries/glossariesPaper dictionaries/glossaries
Corpora, lexicography & terminology Lexicography & terminology AFTER corpora Lexicography & terminology AFTER corpora Emphasis on modern usage in contextEmphasis on modern usage in context Simple definitionsSimple definitions Usage based on evidence in textsUsage based on evidence in texts emphasis on establishing REAL rather than IDEAL usageemphasis on establishing REAL rather than IDEAL usage
COBUILD project Begun in 1969 Begun in 1969 Collins, the well-known dictionary publisher, and the University of Birmingham – led by John Sinclair Collins, the well-known dictionary publisher, and the University of Birmingham – led by John Sinclair A pioneering project A pioneering project Objective > to collect texts for a corpus of contemporary texts from which to extract information on modern English usage Objective > to collect texts for a corpus of contemporary texts from which to extract information on modern English usage Work proceeded during the 70s and 80s - see Sinclair (Ed.) 1987 Work proceeded during the 70s and 80s - see Sinclair (Ed.) 1987
COBUILD > Bank of English Present site for COBUILD > Bank of English uk/docs/about.htm Present site for COBUILD > Bank of English uk/docs/about.htmhttp:// uk/docs/about.htmhttp:// uk/docs/about.htm
British National Corpus (BNC) - original Oxford University Computing Service at Oxford University Computing Service at This completely free – but you only get up to 50 results This completely free – but you only get up to 50 results
Brigham Young University (BYU) Note: Note: Corpus of American English Corpus of American English BNC BNC TIME corpus TIME corpus Corpus de Português Corpus de Português Corpus de Español Corpus de Español
Brigham Young University (BYU) PLEASE NOTE: You will need to create a username and password to use this – but it costs nothing
BNC – CQP version Lancaster university gnup/ gnup/ PLEASE NOTE: You will need to create a username and password to use this – but it costs nothing
Other large monolingual corpora Portuguese > CETEMPUBLICO Portuguese > CETEMPUBLICO Spanish > Real Academia Spanish > Real Academia German > Mannheimer corpus German > Mannheimer corpus
Using corpora to study syntax For example: For example: whether certain nouns occur more often in the singular than pluralwhether certain nouns occur more often in the singular than plural how pronouns are used in different languageshow pronouns are used in different languages which verbs favour certain forms of tense, aspect or moodwhich verbs favour certain forms of tense, aspect or mood how adjectives combine with nounshow adjectives combine with nouns where adjuncts occur in sentenceswhere adjuncts occur in sentences ETCETC
Monolingual corpora General language corpora useful for studying: General language corpora useful for studying: Words in contextWords in context Problems of COLLOCATIONProblems of COLLOCATION Relative usage of synonymsRelative usage of synonyms Syntactic structuresSyntactic structures Sentence structureSentence structure
Parallel Corpora - multilingual European commission - Multilingual European commission - Multilingual EUROPARL - Multilingual EUROPARL - Multilingual ELDA ELDA
Parallel Corpora COMPARA EN/PT COMPARA EN/PT
Corpógrafo - LINGUATECA An on-line suite of tools we have developed for: An on-line suite of tools we have developed for: Construction of corporaConstruction of corpora Semi-automatic extraction of terminologySemi-automatic extraction of terminology Construction of terminology databasesConstruction of terminology databases Terminology & corpora researchTerminology & corpora research Research into information retrieval and knowledge engineeringResearch into information retrieval and knowledge engineering
CORPÓGRAFO FREE! FREE! On-line! On-line! For individual research For individual research
Bibliography ICAME site at ICAME site at BIBER, D., CONRAD, S. & REPPEN, R Corpus Linguistics: Investigating Language structure and Use. Cambridge: Cambridge University Press. BIBER, D., CONRAD, S. & REPPEN, R Corpus Linguistics: Investigating Language structure and Use. Cambridge: Cambridge University Press. BIBER, Douglas,Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan Longman Grammar of Spoken and Written English. Harlow: Pearson Education Ltd. BIBER, Douglas,Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan Longman Grammar of Spoken and Written English. Harlow: Pearson Education Ltd.
Bibliography HOEY, Michael Patterns of Lexis in Text. Oxford: Oxford University Press. ISBN HOEY, Michael Patterns of Lexis in Text. Oxford: Oxford University Press. ISBN MCENERY, Tony & WILSON, Andrew Corpus Linguistics. 2nd Edition. Edinburgh: Edinburgh University Press. MCENERY, Tony & WILSON, Andrew Corpus Linguistics. 2nd Edition. Edinburgh: Edinburgh University Press. OAKES, Michael P Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. ISBN OAKES, Michael P Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. ISBN SINCLAIR, John (ed) Looking Up - An account of the COBUILD project in lexical computing. Collins COBUILD. Collins ELT: London and Glasgow. SINCLAIR, John (ed) Looking Up - An account of the COBUILD project in lexical computing. Collins COBUILD. Collins ELT: London and Glasgow. STUBBS, Michael Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford: Blackwell Publications Ltd. ISBN (pbk). STUBBS, Michael Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford: Blackwell Publications Ltd. ISBN (pbk).