Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Slides:



Advertisements
Similar presentations
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Advertisements

The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
Corpus Processing and NLP
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
L EARNERS ’ D ICTIONARY Deny A. Kwary
Macrostructure  Front matter  Body  Appendices Jackson, Howard Lexicography: An Introduction. London: Routledge, p. 25.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
1 Chinese WordSketch Online, corpus-based summaries of word usage.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
Today Listening test Corpus linguistics talk, Part 3 News task NEOs Life on Mars.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Memory Strategy – Using Mental Images
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Researching language with computers Paul Thompson.
1 Chinese WordSketch Engine Online, corpus-based summaries of word usage.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Averil Coxhead Hüsem Korkmaz MA TEFL. was developed from a corpus of 5 million words with the needs of ESL/EFL learners in mind, contains the most widely.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
How Can Corpora Help Me To Be Successful in CO150?
LEXIS – Focus on Vocabulary
Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK.
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus search What are the most common words in English
Learners' Dictionaries Oxford1948 Longman1978 Collins COBUILD1987 Macmillan2002 Macmillan2008 (bilingualized) Merriam-Webster2008 Jackson, Howard
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
GDEX: Automatically finding good dictionary examples in a corpus.

Evaluating word sketches and corpora
Exploring the BNC Corpus
Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3
Corpora, Language Technology and Maltese
BYU COCA: CORPUS OF CONTEMPORARY AMERICAN ENGLISH
Presentation transcript:

Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex

Madrid April 2010Kilgarriff: Why corpora and how2  Corpora show us the facts of the language

Madrid April 2010Kilgarriff: Why corpora and how3 Exercise  planet  Think about the word  What could you say about it if you were writing a dictionary entry  Write down three (or more) things

Madrid April 2010Kilgarriff: Why corpora and how4 The Sketch Engine: demo 

Madrid April 2010Kilgarriff: Why corpora and how5 Dictionaries  How to decide what to say about the word?

Madrid April 2010Kilgarriff: Why corpora and how6 Dictionaries  How to decide what to say about the word? What the native speaker knows (introspection)

Madrid April 2010Kilgarriff: Why corpora and how7 Dictionaries  How to decide what to say about the word? What the native speaker knows (introspection) What other dictionaries say

Madrid April 2010Kilgarriff: Why corpora and how8 Dictionaries  How to decide what to say about the word? What the native speaker knows (introspection) What other dictionaries say corpus

Madrid April 2010Kilgarriff: Why corpora and how9 Four ages of corpus lexicography

Madrid April 2010Kilgarriff: Why corpora and how10 Age 1: Pre-computer Oxford English Dictionary: 20 million index cards

Madrid April 2010Kilgarriff: Why corpora and how11 Age 2: KWIC Concordances  From 1980  Computerised  Overhauled lexicography

Madrid April 2010Kilgarriff: Why corpora and how12 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no

Madrid April 2010Kilgarriff: Why corpora and how13 Age 3: Collocation statistics  Problem: too much data - how to summarise?  Solution: list of words occurring in neighbourhood of headword, with frequencies  Sorted by salience

Madrid April 2010Kilgarriff: Why corpora and how14 Collocation listing For collocates of save (>5 hits), to right of nodeword word forestslife $1.2dollars livescosts enormousthousands annuallyface jobsestimated moneyyour

Madrid April 2010Kilgarriff: Why corpora and how15 Age-3 collocation statistics: limitations Lists contain  junk  unsorted for type mixes together adverbs, subjects, objects, prepositions What we really want:  noise-free lists  one list for each grammatical relation

Madrid April 2010Kilgarriff: Why corpora and how16 Age 4: The word sketch  Large well-balanced corpus  Parse to find subjects, objects, heads, modifiers etc  One list for each grammatical relation  Statistics to sort each list, as before

Madrid April 2010Kilgarriff: Why corpora and how17 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002, 2007

Madrid April 2010Kilgarriff: Why corpora and how18 Demo part 2

Madrid April 2010Kilgarriff: Why corpora and how19 Fruit task  Choose fruit  Concordance Lemma, noun, lower case  Frequency: node forms  Write down Plural freq (pl) Singular freq (sing)  Compute proportion: pl/(pl+sing)

Madrid April 2010Kilgarriff: Why corpora and how20 What is a corpus?  A collection of texts (as used for linguistic study)  Which texts?  How many?

Madrid April 2010Kilgarriff: Why corpora and how21 Which texts?  Written  Spoken

Madrid April 2010Kilgarriff: Why corpora and how22 Written  Books Fiction Non-fiction Textbooks  Newspapers  Letters, unpublished  Web pages  Academic journals  Student essays  …

Madrid April 2010Kilgarriff: Why corpora and how23 Spoken Must be transcribed, for text corpora  Conversation Who? Region, class, age-group, situation…  Lectures  TV and Radio  Film transcripts  Meetings, seminars  …

Madrid April 2010Kilgarriff: Why corpora and how24 Which texts?  Different purposes, different text types  Making dictionaries: Cover the whole language Some of everything

Madrid April 2010Kilgarriff: Why corpora and how25 How much?  Most words are rare  Zipf’s Law  To get enough data for most words, we need very big corpora

Madrid April 2010Kilgarriff: Why corpora and how26 Zipf’s Law Word (pos) r f r x f the (det) to (prep) as (adv) playing (vb) paint (vb) amateur (adj) 10,

Madrid April 2010Kilgarriff: Why corpora and how27 Zipf’s Law  the: 6%  100 most frequent: 45%  7500 most frequent: 90%  all others: rare

Madrid April 2010Kilgarriff: Why corpora and how28 Zipf’s Law

Madrid April 2010Kilgarriff: Why corpora and how29 Leading English Corpora: Size Size of Corpora (in words) 1960s 1970s 1980s 1990s 2000s Brown/LOB COBUILD BNC OEC

Madrid April 2010Kilgarriff: Why corpora and how30 Good news  The web

Madrid April 2010Kilgarriff: Why corpora and how31 Thank you