Corpora in lexical studies

Slides:



Advertisements
Similar presentations
Part Two: Using Xaira to explore corpora Richard Xiao
Advertisements

Corpora in grammatical studies
Diachronic study and language change Corpus Linguistics Richard Xiao
Making statistic claims
Corpus Linguistics Richard Xiao
Corpus Linguistics Richard Xiao
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES language teaching (1) Bambang Kaswanti Purwo
Diachronic study and language change Corpus Linguistics Richard Xiao
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
What is VOICE? VOICE, the Vienna-Oxford International Corpus of English, is a structured collection of language data, the first computer-readable corpus.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Compiling a corpus II. Corpus A finite size, non random collection of naturally occurring language, in a computer readable form. Non-random = representative.
Corpus linguistics an introduction ENG 447. Key points Basic notions historical development: two competing approacheshistorical development: two competing.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based.
1/23 LELA Lecture 2 Corpus-based research in Linguistics See esp. Meyer pp
Corpora and Language Teaching
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Computational Lexicology, Morphology and Syntax Diana Trandab ă ţ Course 3 Academic year
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Deny A. Kwary Internal Structures of Dictionary Entries.
Memory Strategy – Using Mental Images
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
Getting to know each other. MAAL6018 Vocabulary Teaching And Learning Course Outline Session 1Building blocks and dimensions of vocabulary knowledge.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
1 How to Compute the Meaning of Natural Language Utterances Patrick Hanks, Research Institute of Information and Language Processing, University of Wolverhampton.
Reflections on Using Corpora Data in EFL Teaching CHEN BO Chongqing Jiaotong University 2006.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
Researching language with computers Paul Thompson.
Class 3 Corpora in language teaching. Current trends in FLT  Communicative Language Teaching  Trends within CLT authentic language contextualised language.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)
Corpus search What are the most common words in English
Learners' Dictionaries Oxford1948 Longman1978 Collins COBUILD1987 Macmillan2002 Macmillan2008 (bilingualized) Merriam-Webster2008 Jackson, Howard
Using Corpora to Teach Vocabulary Helping Students Help Themselves 1.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
THE PROCESS OF WORDS BEING ENTERED IN A DICTIONARY WORD FORMATION IN ENGLISH Magdalena Soklevska April, 2016.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
AMANY ALKHAYAT PSCW ENG371 INTRODUCTION TO CORPUS PROCESSING Corpus Processing Ch1.
Vocabulary Module 2 Activity 5.
Corpus Linguistics Anca Dinu February, 2017.
Collocation – Encouraging Learning Independence
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Searching corpora.
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
عمادة التعلم الإلكتروني والتعليم عن بعد
Introduction to Corpus Linguistics: Exploring Collocation
Introduction to Corpus Linguistics: Applications Lexicography
Introduction to Corpus Linguistics: Dispersion/concordance plots
Introduction to Corpus Linguistics: Key Word Analysis
Corpus-Based ELT CEL Symposium Creating Learning Designers
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Corpora, Language Technology and Maltese
Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form.
Presentation transcript:

Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Aims of this session Lecture Lab session Corpus-based lexicography Collocation and colligation Lab session Collocation using WST Collocation using AntConc Collocation and colligation in Xaira Using the BNCweb to study collocation

Corpus revolution in lexicographic and lexical studies Lexicographic and lexical studies are the greatest beneficiaries of corpora Corpora have “revolutionised” dictionary making and reference publishing It is now nearly unheard of for new dictionaries and new editions of old dictionaries published from the 1990s onwards not to claim to be based on corpus data

Why use corpora in dictionary making? Machine-readable corpora allow dictionary makers to extract all authentic, typical examples of the usage of a lexical item from a large body of text in a few seconds Corpora allow dictionary makers to select entries based on frequency information Corpora can readily provide frequency information and collocation information for readers Textual (e.g. register, genre and domain) and sociolinguistic (e.g. user gender and age) information encoded in corpora allows lexicographers to give a more accurate description of the usage of a lexical item

Why use corpora in dictionary making? Corpus annotations such as part-of-speech tagging and word sense disambiguation also enable a more sensible grouping of words which are polysemous and homographs A “monitor corpus” allows lexicographers to track subtle change in the meaning and usage of a lexical item so as to keep their dictionaries up-to-date Corpus evidence can complement or refute the intuitions of individual lexicographers, which are not always reliable because of potential biases in intuitions

Five emphases Changes brought about by corpora to dictionaries and other reference books - five “emphases” (Hunston 2002) an emphasis on frequency an emphasis on collocation and phraseology an emphasis on variation an emphasis on lexis in grammar an emphasis on authenticity

Top 1000 written / spoken words Authentic examples

Corpus-based learner dictionaries First ‘fully corpus-based’ dictionary Collins Cobuild English Dictionary (1987) Some corpus-based learner dictionaries Longman Dictionary of Contemporary English (3rd edition) Oxford Advanced Learner’s Dictionary (OALD, 5th edition) Cambridge International Dictionary of English (1st edition)

Frequency dictionaries

Collocation Collocation is among the linguistic concepts which have benefited most from advances in corpus linguistics What is collocation? strong tea, powerful car (Halliday 1976) “collocations of a given word are statements of the habitual or customary places of that word…the company that words keep” (Firth 1968:181-2) “One of the meanings of night is its collocability with dark” (Firth 1957:196) “a frequent co-occurrence of two lexical items in the language” (Greenbaum 1974:82) expel a school child vs. cashier an army officer “I propose to bring forward as a technical term, meaning by collocation, and apply the test of collocability” (Firth 1957: 194)

Meaning by collocation “There is frequently so high a degree of interdependence between lexemes which tend to occur in texts in collocation with one another that their potentiality for collocation is reasonably described as being part of their meaning” (Lyons 1977: 613) Complete description of the meaning of a word would have to include the other word or words that collocate with it “You shall know a word by the company it keeps!” (Firth 1968:179) Collocation is part of the word meaning

Two types of collocation Coherence collocation vs. neighbourhood (horizontal) collocation (Scott 1998) Coherence collocation Collocates associated with a word (e.g. letter – stamp, post office) Neighbourhood collocation Words which do actually co-occur with the word (letter - my, this, a, etc)

Coherence collocation “A cover term for the cohesion that results from the co-occurrence of lexical items that are in some way or other typically associated with one another, because they tend to occur in similar environments.” (Halliday & Hasan 1976:287) candle – flame – flicker hair – comb – curl – wave sky – sunshine – cloud – rain Difficult to measure using a statistical formula

Neighbourhood collocation Collocation in corpus linguistics Structure of collocation – collocation window “We may use the term node to refer to an item whose collocations we are studying, and we may then define a span as the number of lexical items on each side of a node that we consider relevant to that node. Items in the environment set by the span we will call collocates.” (Sinclair 1966:415) Casual vs. significant collocation Significant collocation: collocation that occurs more frequently than would be expected (in a statistical sense) on the basis of the individual items n.b. Neighbourhood (horizontal) collocations can include some coherence collocations

Intuition vs. collocation Greenbaum (1974): “people disagree on collocations” in introspection-based elicitation experiments Although “collocation can be observed informally” on the basis of intuitions, “it is more reliable to measure it statistically, and for this a corpus is essential” (Hunston 2002: 68) Intuition is often a poor guide to collocation “because each of us has only a partial knowledge of the language, we have prejudices and preferences, our memory is weak, our imagination is powerful (so we can conceive of possible contexts for the most implausible utterances), and we tend to notice unusual words or structures but often overlook ordinary ones” (Krishnamurthy 2000: 32-33) Collocation can be measured on the basis of co-occurrence statistics (MI, z, t, LL etc) – more discussion to follow

Collocation is syntagmatic Langue (Language system) paradigmatic famous boots. On the stroke of full time the Stoke the lead on the stroke of half-time with a goal Smith sin-binned on the stroke of half-time, added a clinched their win on the stroke of lunch after resuming chase by declaring on the stroke of lunch. <p> With a lead expectant crowd, on the stroke of midday. The bird hour began not upon the stroke of midnight but upon the of midnight but upon the stroke of noon. There was, booked in advance. On the stroke of seven, a gong summons Promptly on the stroke of six 'clock, the chooks from Edinburgh on the stroke of the Millennium. Parole (Utterance) syntagmatic

Collocation vs. colligation Relationship between a lexical item and other lexical items Relationship between words at the lexical level E.g. very collocates with good Colligation Relationship between a lexical item and a grammatical category Relationship between words at the grammatical level E.g. very colligates with ADJ

WST Collocate settings Concord tab

WST collocates Strength of relationship is displayed as 0.000 if it hasn't yet been computed

Strength of collocation relationship A wordlist is required

Highlight and double click…

…to see the selected collocate

Collocates in AntConc

Collocation in Xaira

Colligation in Xaira

Exploring collocation with BNCweb http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php

Search for “sweet”

Concordances of “sweet” KWIC view

KWIC view

Dropdown menu: collocations

Collocation setting

Collocation database (default settings)

Adjusting settings

Noun collocates of “sweet” Click on a word to see its collocation info

Collocation info of “sweet” + “smell” Click on a number to see concordances of collocates at that position

Concordances of “smell” at R2

Collocation statistics

Rank by frequency “Sweet Maxwell” is a personal name. Frequent words crowd into the top of the collocate list: Are they genuine collocates?

Rank by the t test Also focusing on frequent words?

Rank by MI Infrequent words at the top of the list n.b. - “Sweet Afton” is a phrase from the lyrics expressing the beauty of the River Afton; “sweet nothings” means romantic and loving talks between sweethearts; “sweet marjoram” is the name of a herb for cooking. Infrequent words at the top of the list How useful are they (especially to English learners)?

Rank by the z score Like MI, the z score also over-estimates infrequent items (e.g. nothings, afton, marjoram)

Log-likelihood test

Rank by MI3

Rank by dice coefficient