Download presentation
Presentation is loading. Please wait.
1
Corpus Linguistics and Corpora
2
Corpus Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies. Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies.
3
Corpus Linguistics Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts. Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts. (cf. Crystal, David. 1992. An Encyclopedic Dictionary of Language and Languages. Oxford, 85) (cf. Crystal, David. 1992. An Encyclopedic Dictionary of Language and Languages. Oxford, 85)
4
Corpus CORPUS (13c: from Latin corpus body. The plural is usually corpora) (1) A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse……….. CORPUS (13c: from Latin corpus body. The plural is usually corpora) (1) A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse……….. (cf. McArthur, Tom 1992 "Corpus", The Oxford Companion to the English Language. Oxford, 265-266) (cf. McArthur, Tom 1992 "Corpus", The Oxford Companion to the English Language. Oxford, 265-266)
5
Chomsky 1957 "Any natural corpus will be skewed. Some sentences won't occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [of language based on the corpus] would be no more than a mere list. " Syntactic structures. The Hague, 159 "Any natural corpus will be skewed. Some sentences won't occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [of language based on the corpus] would be no more than a mere list. " Syntactic structures. The Hague, 159
6
Fillmore 1992 "I have two main observations to make. "I have two main observations to make. The first is that I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore; all that I have seen are inadequate. The first is that I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore; all that I have seen are inadequate.
7
Fillmore 1992 The second observation is that every corpus that I've had a chance to examine, however small, has taught me facts that I couldn't imagine finding out about in any other way." The second observation is that every corpus that I've had a chance to examine, however small, has taught me facts that I couldn't imagine finding out about in any other way." In "Corpus linguistics" or "Computer-aided armchair linguistics", in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics. Berlin/New York, 35. In "Corpus linguistics" or "Computer-aided armchair linguistics", in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics. Berlin/New York, 35.
8
Types of corpus Monolingual corpora - in which the texts are all in the same language Monolingual corpora - in which the texts are all in the same language Parallel and/or aligned corpora - in which originals and translations are aligned so that both texts are synchronized to appear on the screen together and it is easy to see how the translator has translated the original. Parallel and/or aligned corpora - in which originals and translations are aligned so that both texts are synchronized to appear on the screen together and it is easy to see how the translator has translated the original.
9
Types of corpus Comparable corpora - in which a selection of original texts has been made in two or more languages dealing with the same subject or genre. Comparable corpora - in which a selection of original texts has been made in two or more languages dealing with the same subject or genre. Concurrent corpora - a term used to describe texts taken from newspapers on the same subject on approximately the same dates. Concurrent corpora - a term used to describe texts taken from newspapers on the same subject on approximately the same dates.
10
Types of corpus Specialized corpora - texts on specialized subjects. The principal use for these corpora is the extraction of terminology and complementary explanatory material - definitions, explanations, semantic relations etc Specialized corpora - texts on specialized subjects. The principal use for these corpora is the extraction of terminology and complementary explanatory material - definitions, explanations, semantic relations etc
11
Types of corpus 'Do-it-yourself ' corpora - a term coined by those of us using small specialized corpora for the purpose of teaching translation or language 'Do-it-yourself ' corpora - a term coined by those of us using small specialized corpora for the purpose of teaching translation or language Disposable corpora - the same as 'do-it- yourself' corpora, but taking into account that such corpora need to be disposed of after use so that their users do not get into trouble with copyright restrictions. Disposable corpora - the same as 'do-it- yourself' corpora, but taking into account that such corpora need to be disposed of after use so that their users do not get into trouble with copyright restrictions.
12
How do you search a corpus? Concordancing Concordancing Sentence level – see BNC Sentence level – see BNC http://www.natcorp.ox.ac.uk COMPARA – parallel concordance http://www.linguateca.pt/COMPARA COMPARA – parallel concordance http://www.linguateca.pt/COMPARA http://www.linguateca.pt/COMPARA
13
The Survey of English Usage 60s - Randolph Quirk et al > launched the Survey of English Usage (SEU) 60s - Randolph Quirk et al > launched the Survey of English Usage (SEU) "with the aim of collecting a large and stylistically varied corpus as the basis for a systematic description of spoken and written English"with the aim of collecting a large and stylistically varied corpus as the basis for a systematic description of spoken and written English
14
The Survey of English Usage Brown, Lancaster-Oslo/Bergen (LOB) and London-Lund Corpus of Spoken EnglishBrown, Lancaster-Oslo/Bergen (LOB) and London-Lund Corpus of Spoken English See ICAME - International Computer Archive of Modern and Medieval English at the Norwegian Computing Centre for the Humanities at http://gandalf.aksis.uib.no/icame.htmlSee ICAME - International Computer Archive of Modern and Medieval English at the Norwegian Computing Centre for the Humanities at http://gandalf.aksis.uib.no/icame.html http://gandalf.aksis.uib.no/icame.html
15
The Survey of English Usage Today at University of London at http://www.ucl.ac.uk/english-usage/ Today at University of London at http://www.ucl.ac.uk/english-usage/ http://www.ucl.ac.uk/english-usage/ ICE - the International Corpus of English ICE - the International Corpus of English Download the sampler of this corpus fully tagged and analysed from http://www.ucl.ac.uk/english- usage/ice-gb/sampler/form.htm Download the sampler of this corpus fully tagged and analysed from http://www.ucl.ac.uk/english- usage/ice-gb/sampler/form.htm http://www.ucl.ac.uk/english- usage/ice-gb/sampler/form.htm http://www.ucl.ac.uk/english- usage/ice-gb/sampler/form.htm
16
Quality versus quantity A small but fully analyzed and tagged - e.g. early corpora and ICE (1 million words) A small but fully analyzed and tagged - e.g. early corpora and ICE (1 million words) British National Corpus – 100 million words British National Corpus – 100 million words Other corpora Other corpora Bank of English - 450 millionBank of English - 450 million The Internet The Internet
17
Corpora, lexicography & terminology Lexicography BEFORE corpora Lexicography BEFORE corpora Emphasis on etymologyEmphasis on etymology Complex definitionsComplex definitions Usage based on intuitions of lexicographersUsage based on intuitions of lexicographers Terminology BEFORE corpora Terminology BEFORE corpora Standardization > one word= one concept, rigid definitionsStandardization > one word= one concept, rigid definitions Paper dictionaries/glossariesPaper dictionaries/glossaries
18
Corpora, lexicography & terminology Lexicography & terminology AFTER corpora Lexicography & terminology AFTER corpora Emphasis on modern usage in contextEmphasis on modern usage in context Simple definitionsSimple definitions Usage based on evidence in textsUsage based on evidence in texts emphasis on establishing REAL rather than IDEAL usageemphasis on establishing REAL rather than IDEAL usage
19
COBUILD project Begun in 1969 Begun in 1969 Collins, the well-known dictionary publisher, and the University of Birmingham – led by John Sinclair Collins, the well-known dictionary publisher, and the University of Birmingham – led by John Sinclair A pioneering project A pioneering project Objective > to collect texts for a corpus of contemporary texts from which to extract information on modern English usage Objective > to collect texts for a corpus of contemporary texts from which to extract information on modern English usage Work proceeded during the 70s and 80s - see Sinclair (Ed.) 1987 Work proceeded during the 70s and 80s - see Sinclair (Ed.) 1987
20
COBUILD > Bank of English Present site for COBUILD > Bank of English http://www.titania.bham.ac. uk/docs/about.htm Present site for COBUILD > Bank of English http://www.titania.bham.ac. uk/docs/about.htmhttp://www.titania.bham.ac. uk/docs/about.htmhttp://www.titania.bham.ac. uk/docs/about.htm
21
British National Corpus (BNC) - original Oxford University Computing Service at http://www.natcorp.ox.ac.uk/ Oxford University Computing Service at http://www.natcorp.ox.ac.uk/http://www.natcorp.ox.ac.uk/ This completely free – but you only get up to 50 results This completely free – but you only get up to 50 results
22
Brigham Young University (BYU) http://corpus.byu.edu/ Note: Note: Corpus of American English Corpus of American English BNC BNC TIME corpus TIME corpus Corpus de Português Corpus de Português Corpus de Español Corpus de Español
23
Brigham Young University (BYU) PLEASE NOTE: You will need to create a username and password to use this – but it costs nothing
24
BNC – CQP version Lancaster university http://bncweb.lancs.ac.uk/bncwebSi gnup/ http://bncweb.lancs.ac.uk/bncwebSi gnup/ PLEASE NOTE: You will need to create a username and password to use this – but it costs nothing
25
Other large monolingual corpora Portuguese > CETEMPUBLICO http:// www.linguateca.pt/cetempublico/ Portuguese > CETEMPUBLICO http:// www.linguateca.pt/cetempublico/ http:// www.linguateca.pt/cetempublico/ http:// www.linguateca.pt/cetempublico/ Spanish > Real Academia Spanish > Real Academia German > Mannheimer corpus German > Mannheimer corpus
26
Using corpora to study syntax For example: For example: whether certain nouns occur more often in the singular than pluralwhether certain nouns occur more often in the singular than plural how pronouns are used in different languageshow pronouns are used in different languages which verbs favour certain forms of tense, aspect or moodwhich verbs favour certain forms of tense, aspect or mood how adjectives combine with nounshow adjectives combine with nouns where adjuncts occur in sentenceswhere adjuncts occur in sentences ETCETC
27
Monolingual corpora General language corpora useful for studying: General language corpora useful for studying: Words in contextWords in context Problems of COLLOCATIONProblems of COLLOCATION Relative usage of synonymsRelative usage of synonyms Syntactic structuresSyntactic structures Sentence structureSentence structure
28
Parallel Corpora - multilingual European commission - Multilingual http://ec.europa.eu/ European commission - Multilingual http://ec.europa.eu/ http://ec.europa.eu/ EUROPARL - Multilingual http://www.statmt.org/europarl/ EUROPARL - Multilingual http://www.statmt.org/europarl/ http://www.statmt.org/europarl/ ELDA http://www.elda.org/sommaire.php ELDA http://www.elda.org/sommaire.php http://www.elda.org/sommaire.php
29
Parallel Corpora COMPARA EN/PT http://www.linguateca.pt/compara COMPARA EN/PT http://www.linguateca.pt/compara http://www.linguateca.pt/compara
30
Corpógrafo - LINGUATECA An on-line suite of tools we have developed for: An on-line suite of tools we have developed for: Construction of corporaConstruction of corpora Semi-automatic extraction of terminologySemi-automatic extraction of terminology Construction of terminology databasesConstruction of terminology databases Terminology & corpora researchTerminology & corpora research Research into information retrieval and knowledge engineeringResearch into information retrieval and knowledge engineering
31
CORPÓGRAFO http://www.linguateca.pt/corpografo http://www.linguateca.pt/corpografo http://www.linguateca.pt/corpografo FREE! FREE! On-line! On-line! For individual research For individual research
32
Bibliography ICAME site at http://helmer.aksis.uib.no/icame.html ICAME site at http://helmer.aksis.uib.no/icame.htmlhttp://helmer.aksis.uib.no/icame.html BIBER, D., CONRAD, S. & REPPEN, R. 1998 Corpus Linguistics: Investigating Language structure and Use. Cambridge: Cambridge University Press. BIBER, D., CONRAD, S. & REPPEN, R. 1998 Corpus Linguistics: Investigating Language structure and Use. Cambridge: Cambridge University Press. BIBER, Douglas,Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan. 1999. Longman Grammar of Spoken and Written English. Harlow: Pearson Education Ltd. BIBER, Douglas,Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan. 1999. Longman Grammar of Spoken and Written English. Harlow: Pearson Education Ltd.
33
Bibliography HOEY, Michael. 1991. Patterns of Lexis in Text. Oxford: Oxford University Press. ISBN 0 19 437142 5. HOEY, Michael. 1991. Patterns of Lexis in Text. Oxford: Oxford University Press. ISBN 0 19 437142 5. MCENERY, Tony & WILSON, Andrew. 2001. Corpus Linguistics. 2nd Edition. Edinburgh: Edinburgh University Press. MCENERY, Tony & WILSON, Andrew. 2001. Corpus Linguistics. 2nd Edition. Edinburgh: Edinburgh University Press. OAKES, Michael P. 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. ISBN 0 7486 0817 6 OAKES, Michael P. 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. ISBN 0 7486 0817 6 SINCLAIR, John (ed) 1987. Looking Up - An account of the COBUILD project in lexical computing. Collins COBUILD. Collins ELT: London and Glasgow. SINCLAIR, John (ed) 1987. Looking Up - An account of the COBUILD project in lexical computing. Collins COBUILD. Collins ELT: London and Glasgow. STUBBS, Michael. 1996. Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford: Blackwell Publications Ltd. ISBN 0-631-19512-2 (pbk). STUBBS, Michael. 1996. Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford: Blackwell Publications Ltd. ISBN 0-631-19512-2 (pbk).
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.