Corpus Linguistics and Corpora. Corpus Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription.

Slides:



Advertisements
Similar presentations
Corpora in grammatical studies
Advertisements

Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
English Lexicography.
L EARNERS ’ D ICTIONARY Deny A. Kwary
1 Analysing and teaching meaning (3) Analysing and teaching meaning (3) SSIS Lazio - Lesson 3 prof. Hugo Bowles January 2007.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Corpus linguistics an introduction ENG 447. Key points Basic notions historical development: two competing approacheshistorical development: two competing.
Harnessing Corpora for real and virtual ELT purposes IFELT Belinda Maia FLUP 10/
USP workshop Using the Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Working with COMPARA an online parallel corpus of English and Portuguese fiction Ana Frankenberg-Garcia.
Using Corpora in Linguistics Introduction to WordSmith Tools for Beginners Íde O’Sullivan Regional Writing Centre
LELA English Corpus Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based.
1/23 LELA Lecture 2 Corpus-based research in Linguistics See esp. Meyer pp
The application of corpus analysis and concordance feedback to collegiate EFL writing Presenter: Wen-Shuenn Wu (Michael Wu) Chung Hua University, Hsinchu,
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Deny A. Kwary Internal Structures of Dictionary Entries.
English Corpora and Language Learning Tamás Váradi
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Linguistics and Language
Chapter 1: By: Ms. Ola Al-arjani
Researching language with computers Paul Thompson.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES introduction (02) Bambang Kaswanti Purwo
©2006 Barry Natusch Tools for Language Researchers Barry Natusch “ Man is a tool-using animal. Without tools he is nothing, with tools he is all. ” - Thomas.
Overview of technologies for translators and language service providers Belinda Maia University of Porto.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
Phrases and Clauses L/O: to revise/learn how to analyse larger units of language – phrases and clauses to revise/learn how to analyse larger units of language.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
How Can Corpora Help Me To Be Successful in CO150?
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Corpus approaches to discourse
1 Branches of Linguistics. 2 Branches of linguistics Linguists are engaged in a multiplicity of studies, some of which bear little direct relationship.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
INTRODUCTION TO APPLIED LINGUISTICS
Approaches to teaching English The differences between EAP and General EFL Louis Rogers.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Using Corpora in Linguistics
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
Introduction to Corpus Linguistics: Applications Lexicography
Intro to corpus linguistics: Data Driven Grammar
Corpora and Concordancers in ESL/EFL Class:
Corpus-Based ELT CEL Symposium Creating Learning Designers
Stylistics and Stylometry
(word formation: follow up)
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form.
Presentation transcript:

Corpus Linguistics and Corpora

Corpus Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies. Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies.

Corpus Linguistics Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts. Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts. (cf. Crystal, David An Encyclopedic Dictionary of Language and Languages. Oxford, 85) (cf. Crystal, David An Encyclopedic Dictionary of Language and Languages. Oxford, 85)

Corpus CORPUS (13c: from Latin corpus body. The plural is usually corpora) (1) A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse……….. CORPUS (13c: from Latin corpus body. The plural is usually corpora) (1) A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse……….. (cf. McArthur, Tom 1992 "Corpus", The Oxford Companion to the English Language. Oxford, ) (cf. McArthur, Tom 1992 "Corpus", The Oxford Companion to the English Language. Oxford, )

Chomsky 1957 "Any natural corpus will be skewed. Some sentences won't occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [of language based on the corpus] would be no more than a mere list. " Syntactic structures. The Hague, 159 "Any natural corpus will be skewed. Some sentences won't occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [of language based on the corpus] would be no more than a mere list. " Syntactic structures. The Hague, 159

Fillmore 1992 "I have two main observations to make. "I have two main observations to make. The first is that I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore; all that I have seen are inadequate. The first is that I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore; all that I have seen are inadequate.

Fillmore 1992 The second observation is that every corpus that I've had a chance to examine, however small, has taught me facts that I couldn't imagine finding out about in any other way." The second observation is that every corpus that I've had a chance to examine, however small, has taught me facts that I couldn't imagine finding out about in any other way." In "Corpus linguistics" or "Computer-aided armchair linguistics", in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics. Berlin/New York, 35. In "Corpus linguistics" or "Computer-aided armchair linguistics", in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics. Berlin/New York, 35.

Types of corpus Monolingual corpora - in which the texts are all in the same language Monolingual corpora - in which the texts are all in the same language Parallel and/or aligned corpora - in which originals and translations are aligned so that both texts are synchronized to appear on the screen together and it is easy to see how the translator has translated the original. Parallel and/or aligned corpora - in which originals and translations are aligned so that both texts are synchronized to appear on the screen together and it is easy to see how the translator has translated the original.

Types of corpus Comparable corpora - in which a selection of original texts has been made in two or more languages dealing with the same subject or genre. Comparable corpora - in which a selection of original texts has been made in two or more languages dealing with the same subject or genre. Concurrent corpora - a term used to describe texts taken from newspapers on the same subject on approximately the same dates. Concurrent corpora - a term used to describe texts taken from newspapers on the same subject on approximately the same dates.

Types of corpus Specialized corpora - texts on specialized subjects. The principal use for these corpora is the extraction of terminology and complementary explanatory material - definitions, explanations, semantic relations etc Specialized corpora - texts on specialized subjects. The principal use for these corpora is the extraction of terminology and complementary explanatory material - definitions, explanations, semantic relations etc

Types of corpus 'Do-it-yourself ' corpora - a term coined by those of us using small specialized corpora for the purpose of teaching translation or language 'Do-it-yourself ' corpora - a term coined by those of us using small specialized corpora for the purpose of teaching translation or language Disposable corpora - the same as 'do-it- yourself' corpora, but taking into account that such corpora need to be disposed of after use so that their users do not get into trouble with copyright restrictions. Disposable corpora - the same as 'do-it- yourself' corpora, but taking into account that such corpora need to be disposed of after use so that their users do not get into trouble with copyright restrictions.

How do you search a corpus? Concordancing Concordancing Sentence level – see BNC Sentence level – see BNC COMPARA – parallel concordance COMPARA – parallel concordance

The Survey of English Usage 60s - Randolph Quirk et al > launched the Survey of English Usage (SEU) 60s - Randolph Quirk et al > launched the Survey of English Usage (SEU) "with the aim of collecting a large and stylistically varied corpus as the basis for a systematic description of spoken and written English"with the aim of collecting a large and stylistically varied corpus as the basis for a systematic description of spoken and written English

The Survey of English Usage Brown, Lancaster-Oslo/Bergen (LOB) and London-Lund Corpus of Spoken EnglishBrown, Lancaster-Oslo/Bergen (LOB) and London-Lund Corpus of Spoken English See ICAME - International Computer Archive of Modern and Medieval English at the Norwegian Computing Centre for the Humanities at ICAME - International Computer Archive of Modern and Medieval English at the Norwegian Computing Centre for the Humanities at

The Survey of English Usage Today at University of London at Today at University of London at ICE - the International Corpus of English ICE - the International Corpus of English Download the sampler of this corpus fully tagged and analysed from usage/ice-gb/sampler/form.htm Download the sampler of this corpus fully tagged and analysed from usage/ice-gb/sampler/form.htm usage/ice-gb/sampler/form.htm usage/ice-gb/sampler/form.htm

Quality versus quantity A small but fully analyzed and tagged - e.g. early corpora and ICE (1 million words) A small but fully analyzed and tagged - e.g. early corpora and ICE (1 million words) British National Corpus – 100 million words British National Corpus – 100 million words Other corpora Other corpora Bank of English millionBank of English million The Internet The Internet

Corpora, lexicography & terminology Lexicography BEFORE corpora Lexicography BEFORE corpora Emphasis on etymologyEmphasis on etymology Complex definitionsComplex definitions Usage based on intuitions of lexicographersUsage based on intuitions of lexicographers Terminology BEFORE corpora Terminology BEFORE corpora Standardization > one word= one concept, rigid definitionsStandardization > one word= one concept, rigid definitions Paper dictionaries/glossariesPaper dictionaries/glossaries

Corpora, lexicography & terminology Lexicography & terminology AFTER corpora Lexicography & terminology AFTER corpora Emphasis on modern usage in contextEmphasis on modern usage in context Simple definitionsSimple definitions Usage based on evidence in textsUsage based on evidence in texts emphasis on establishing REAL rather than IDEAL usageemphasis on establishing REAL rather than IDEAL usage

COBUILD project Begun in 1969 Begun in 1969 Collins, the well-known dictionary publisher, and the University of Birmingham – led by John Sinclair Collins, the well-known dictionary publisher, and the University of Birmingham – led by John Sinclair A pioneering project A pioneering project Objective > to collect texts for a corpus of contemporary texts from which to extract information on modern English usage Objective > to collect texts for a corpus of contemporary texts from which to extract information on modern English usage Work proceeded during the 70s and 80s - see Sinclair (Ed.) 1987 Work proceeded during the 70s and 80s - see Sinclair (Ed.) 1987

COBUILD > Bank of English Present site for COBUILD > Bank of English uk/docs/about.htm Present site for COBUILD > Bank of English uk/docs/about.htmhttp:// uk/docs/about.htmhttp:// uk/docs/about.htm

British National Corpus (BNC) - original Oxford University Computing Service at Oxford University Computing Service at This completely free – but you only get up to 50 results This completely free – but you only get up to 50 results

Brigham Young University (BYU) Note: Note: Corpus of American English Corpus of American English BNC BNC TIME corpus TIME corpus Corpus de Português Corpus de Português Corpus de Español Corpus de Español

Brigham Young University (BYU) PLEASE NOTE: You will need to create a username and password to use this – but it costs nothing

BNC – CQP version Lancaster university gnup/ gnup/ PLEASE NOTE: You will need to create a username and password to use this – but it costs nothing

Other large monolingual corpora Portuguese > CETEMPUBLICO Portuguese > CETEMPUBLICO Spanish > Real Academia Spanish > Real Academia German > Mannheimer corpus German > Mannheimer corpus

Using corpora to study syntax For example: For example: whether certain nouns occur more often in the singular than pluralwhether certain nouns occur more often in the singular than plural how pronouns are used in different languageshow pronouns are used in different languages which verbs favour certain forms of tense, aspect or moodwhich verbs favour certain forms of tense, aspect or mood how adjectives combine with nounshow adjectives combine with nouns where adjuncts occur in sentenceswhere adjuncts occur in sentences ETCETC

Monolingual corpora General language corpora useful for studying: General language corpora useful for studying: Words in contextWords in context Problems of COLLOCATIONProblems of COLLOCATION Relative usage of synonymsRelative usage of synonyms Syntactic structuresSyntactic structures Sentence structureSentence structure

Parallel Corpora - multilingual European commission - Multilingual European commission - Multilingual EUROPARL - Multilingual EUROPARL - Multilingual ELDA ELDA

Parallel Corpora COMPARA EN/PT COMPARA EN/PT

Corpógrafo - LINGUATECA An on-line suite of tools we have developed for: An on-line suite of tools we have developed for: Construction of corporaConstruction of corpora Semi-automatic extraction of terminologySemi-automatic extraction of terminology Construction of terminology databasesConstruction of terminology databases Terminology & corpora researchTerminology & corpora research Research into information retrieval and knowledge engineeringResearch into information retrieval and knowledge engineering

CORPÓGRAFO FREE! FREE! On-line! On-line! For individual research For individual research

Bibliography ICAME site at ICAME site at BIBER, D., CONRAD, S. & REPPEN, R Corpus Linguistics: Investigating Language structure and Use. Cambridge: Cambridge University Press. BIBER, D., CONRAD, S. & REPPEN, R Corpus Linguistics: Investigating Language structure and Use. Cambridge: Cambridge University Press. BIBER, Douglas,Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan Longman Grammar of Spoken and Written English. Harlow: Pearson Education Ltd. BIBER, Douglas,Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan Longman Grammar of Spoken and Written English. Harlow: Pearson Education Ltd.

Bibliography HOEY, Michael Patterns of Lexis in Text. Oxford: Oxford University Press. ISBN HOEY, Michael Patterns of Lexis in Text. Oxford: Oxford University Press. ISBN MCENERY, Tony & WILSON, Andrew Corpus Linguistics. 2nd Edition. Edinburgh: Edinburgh University Press. MCENERY, Tony & WILSON, Andrew Corpus Linguistics. 2nd Edition. Edinburgh: Edinburgh University Press. OAKES, Michael P Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. ISBN OAKES, Michael P Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. ISBN SINCLAIR, John (ed) Looking Up - An account of the COBUILD project in lexical computing. Collins COBUILD. Collins ELT: London and Glasgow. SINCLAIR, John (ed) Looking Up - An account of the COBUILD project in lexical computing. Collins COBUILD. Collins ELT: London and Glasgow. STUBBS, Michael Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford: Blackwell Publications Ltd. ISBN (pbk). STUBBS, Michael Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford: Blackwell Publications Ltd. ISBN (pbk).