Corpus linguistics an introduction ENG 447. Key points Basic notions historical development: two competing approacheshistorical development: two competing.

Slides:



Advertisements
Similar presentations
Corpora in lexical studies
Advertisements

Uses of a Corpus “[E]xplore actual patterns of language use”
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Advanced AI - Part II Luc De Raedt University of Freiburg WS 2004/2005 Many slides taken from Helmut Schmid.
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Using Corpora in Linguistics Introduction to WordSmith Tools for Beginners Íde O’Sullivan Regional Writing Centre
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
LELA English Corpus Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
1/23 LELA Lecture 2 Corpus-based research in Linguistics See esp. Meyer pp
Corpus Linguistics and Corpora. Corpus Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription.
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Computational Lexicology, Morphology and Syntax Diana Trandab ă ţ Course 3 Academic year
Lecture 1 Introduction: Linguistic Theory and Theories
Second language acquisition
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Deny A. Kwary Internal Structures of Dictionary Entries.
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Corpus Linguistics Lecture 1 Albert Gatt. Contact details  My  Drop me a line with queries etc, and.
Linguistics and Language
Researching language with computers Paul Thompson.
Corpus-assisted discourse analysis
Corpus Linguistics Developing a PolyU Language Bank Sherman Lee PI: Grahame Bilbow Thanks to: Chris Greaves, Raymond Cheung, Li.
Chapter 10 Language and Computer English Linguistics: An Introduction.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
I. INTRODUCTION.
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
Enda F. Scott 2001 Good morning An introduction to modern dictionary making.
Corpus search What are the most common words in English
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Some Distinctions in Linguistics. Descriptivism & Prescriptivism Synchronic & diachronic Speech & writing Language & parole Competence & performance Traditional.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
INTRODUCTION TO APPLIED LINGUISTICS
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Ling 306 Corpus-based English Language Studies Introduction to the course Introduction to corpus analysis techniques.
Corpus Linguistics Anca Dinu February, 2017.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Using Corpora in Linguistics
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Intro to corpus linguistics: Data Driven Grammar
Corpus-Based ELT CEL Symposium Creating Learning Designers
Introduction To Linguistics
(word formation: follow up)
McEnery, T. , Xiao, R. and Y. Tono Corpus-based language studies
Using GOLD to Tracking L2 Development
Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form.
Presentation transcript:

Corpus linguistics an introduction ENG 447

Key points Basic notions historical development: two competing approacheshistorical development: two competing approaches Types of corpus Exploiting a corpus Resources

Basic notions Corpus: A collection of naturally occurring language text, chosen to characterise a state or variety of language (Sinclair) A collection of linguistic data, either written text or a transcription of recorded data, which can be used as starting-point of linguistic description or as a means of verifying hypotheses about a language (Dictionary of linguistics and phonetics)

What is a corpus? Large body of evidence typically composed of attested language use (McEnery) Usually a corpus is in machine-readable format and is ideally viewable and analysable through (a single) software package The word corpus comes from Latin body and the plural is corpora

“ If it happens once, you don't know anything. If it happens twice, it suggests further investigation. If it happens three or more times, then you have something to write about! ”

History We have to split the history in two periods: before Chomsky and after Chomsky Before Chomsky, methods similar to the ones in corpus linguistics were used (empiricism)

Early corpus linguistics Before Chomsky Computers were not available so it was difficult to analyse large collections of text Studies of child language using diaries kept by parents Spelling conventions in a German corpus of 11 million words Foreign language pedagogy

Early corpus linguistics (II) All the work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions: The sentences of a natural language are finite. The sentences of a natural language can be collected and enumerated. Most linguists saw the corpus as the only source of linguistic evidence in the formation of linguistic theories

Chomsky Between 1957 and 1965 Chomsky changed the direction of linguistics from empiricism towards rationalism “Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, other because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list” (Chomsky, 1962) Introspection started to be used instead

Problems with introspection Naturally occurring data is observable and verifiable by everyone. Introspective data is artificial. Human beings have only the vaguest notion of the frequency of a construct or a word.

The revival of corpus linguistics The research in corpus linguistics was continued in small centres The hardware still imposed some restrictions, the real development will start in the 80s Fields like computational linguistics were not interested to use corpora

Fillmore ’ s description of the two approaches The corpus linguist : " He has all the primary facts that he needs, in the form of a corpus of approximately one zillion running words, and he sees his job as that of deriving secondary facts from his primary facts. At the moment, he is busy determining the relative frequencies of the eleven parts of speech as the first word of a sentence versus the second word of a sentence.”

The "armchair " (introspective) linguist : "He sits in a deep soft armchair, with his eyes closed and his hands clasped behind his head. Once in a while he opens his eyes, sits up abruptly shouting, ‘Wow, what a neat fact!’, grabs his pencil, and writes something down… having come still no closer to knowing what language is really like."

Goals of corpus linguistics Chomskyan linguistics ‘ Langue ’ (competence) Ideal speaker/hearer Language = innate mental faculty Intuitive evidence Universals Grammar Corpus linguistics ‘ Parole ’ (performance) Complexity/variation Language = social phenomenonsocial Empirical evidence Differences Meaning

Types of corpora Written vs Spoken General vs Specialised e.g. ESP, Learner corpora Monolingual vs Multilingual e.g. Parallel, Comparable Synchronic vs Diachronic; Monitor Annotated vs Unannotated

Written corpora

Specialised corpora

Other examples of available corpora

Ways to exploit a corpus Word (token) / types frequency lists N-grams Concordances Collocations/collegations Specially designed programs (especially when the corpus is annotated)

Frequency lists are lists which indicates the words which appear in a corpus and their frequency they provide a survey of the corpus a frequency list becomes more meaningful when compared with other lists they remove a word from its contexts

N-grams groups of N words which appear in sequence in the text they are presented using frequency lists good way to identify recurring/specific expressions for a corpus provide limited context for the words

Concordances show words in the context they appear usually they are obtained using special programs which allow to manipulate the lists of concordances KWIC (Key Word In Context) is the most common format

Example of concordance output (from MonoConc)

Langue - Parole famous boots. On the stroke of full time the Stoke the lead on the stroke of half-time with a goal Smith sin-binned on the stroke of half-time, added a clinched their win on the stroke of lunch after resuming chase by declaring on the stroke of lunch. With a lead expectant crowd, on the stroke of midday. The bird hour began not upon the stroke of midnight but upon the of midnight but upon the stroke of noon. There was, booked in advance. On the stroke of seven, a gong summons Promptly on the stroke of six 'clock, the chooks from Edinburgh on the stroke of the Millennium. Parole syntbagmatic Langue paradigmatic

Collocations collocation = the occurrence of two or more words within a short space of each other in text the collocates are extracted using a window to the left and right of a specified word can be used to further analyse the context of a word

What can we do with a corpus? -- Two broad approaches Corpus-based approaches: hypotheses are checked against a corpus Corpus-driven approaches: hypotheses are drawn from the corpus

-le is a separate morpheme for the concept of future. Find all occurrences of “le” in the wordforms of the corpus. Posit a hypothesis Concordance Test hypothesis Testing new hypothesis

Fields where corpora are used Lexicography to design dictionaries Language studies (relations between languages, differences between genre, evolution of the language) Computational linguistics (training and testing methods) Language teaching (learner’s corpora) Cultural studies, psycholinguistics

Web as a corpus The Web can be very useful source of texts The Web is very helpful for languages other than English Quite often there is not control on the language which is investigated therefore filtering (if possible) is necessary

Existing corpora Brown Corpus/LOB corpus Bank of English Wall Street Journal, Penn Tree Bank, BNC, ANC, ICE, WBE, Reuters Corpus Canadian Hansard: parallel corpus English-French York-Helsinki Parsed corpus of Old Poetry Tiger corpus – German CORII/CODIS - contemporary written Italian MULTEX 1984 and The Republic in many languages

References Karin Aijmer and Bengt Altenberg (1991) English corpus linguistics, Longman Duglas Biber, Susan Cnrad and Randi Reppen (1998) Corpus linguistics, Cambridge University Press Graeme D. Kennedy (1998) An introduction to corpus linguistics, Longman Tony McEnery and Andrew Wilson (1996) Corpus linguistics, Edinburgh University Press

Online resources 语料库语言学在线 语料库语言学与英语教育教学 ConCapp rd.rar BNC : CLEC: CHILDES: