Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK.

Slides:



Advertisements
Similar presentations
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Advertisements

A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing.
The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
Corpus Processing and NLP
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
How to evaluate a corpus Adam Kilgarriff with: Vit Baisa, Milos Jakubicek, Vojtech Kovar, Pavel Rychly Lexical Computing Ltd and Leeds University / FI,
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Evaluating the Waspbench A Lexicography Tool Incorporating Word Sense Disambiguation Rob Koeling, Adam Kilgarriff, David Tugwell, Roger Evans ITRI, University.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
L EARNERS ’ D ICTIONARY Deny A. Kwary
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Memory Strategy – Using Mental Images
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
1 Chinese WordSketch Engine Online, corpus-based summaries of word usage.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
TALC Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
1 Collocationality and how to measure it Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Do we need lexicographers? Prospects for automatic lexicography Adam Kilgarriff Lexical Computing Ltd University of Leeds UK.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Sketch engine for Chinese Discussion notes. Wordsketch, subsequently Sketch Engine Was developed by Kilgarriff et al at Brighton Gives automatic, corpus-based.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus search What are the most common words in English
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
GDEX: Automatically finding good dictionary examples in a corpus.
Learning Usage of English KWICly with WebLEAP/DSR
Making useful wordlists for ELT
Evaluating word sketches and corpora
Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3
Corpora, Language Technology and Maltese
Presentation transcript:

Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010Kilgarriff2 Outline  Precision and recall  History of corpus lexicography  The Sketch Engine –Demo  Web corpora  Corpus and dictionary

IDS Mannheim 2010Kilgarriff3 Find me all the fat cats  a request for information

IDS Mannheim 2010Kilgarriff4 High recall  Lots of responses  Maybe not all good

IDS Mannheim 2010Kilgarriff5 High precision  Fewer hits  Higher confidence

IDS Mannheim 2010Kilgarriff6 Information-seeking RecallPrecision Computers good bad People bad good

IDS Mannheim 2010Kilgarriff7 Lexicography: finding facts about words  collocations  grammatical patterns  idioms  synonyms  meanings  translations

IDS Mannheim 2010Kilgarriff8 Four ages of corpus lexicography

IDS Mannheim 2010Kilgarriff9 Age 1: Pre computer Oxford English Dictionary: 5 million index cards

IDS Mannheim 2010Kilgarriff10 Age 2: KWIC Concordances  From 1980  Computerised  Overhauled lexicography

IDS Mannheim 2010Kilgarriff11 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no

IDS Mannheim 2010Kilgarriff12 Age 3: Collocation statistics  Problem: too much data - how to summarise?  Solution: list of words occurring in neighbourhood of headword, with frequencies  Sorted by salience

IDS Mannheim 2010Kilgarriff13 Age-3 collocation statistics: limitations Lists contain  junk  unsorted for type – mixes together adverbs, subjects, objects, prepositions What we really want:  noise-free lists  one list for each grammatical relation

IDS Mannheim 2010Kilgarriff14 Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword word forestslife $1.2dollars livescosts enormousthousands annuallyface jobsestimated moneyyour

IDS Mannheim 2010Kilgarriff15 Age 4: The word sketch  Large well-balanced corpus  Parse to find – subjects, objects, heads, modifiers etc  One list for each grammatical relation  Statistics to sort each list, as before

IDS Mannheim 2010Kilgarriff16 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002

IDS Mannheim 2010Kilgarriff17 Working practice  Lexicographers mainly used sketches not concordances –missed less, more consistent –Faster

IDS Mannheim 2010Kilgarriff18 Euralex 2002

IDS Mannheim 2010Kilgarriff19 Euralex 2002  Can I have them for my language please

IDS Mannheim 2010Kilgarriff20 The Sketch Engine  Input: –any corpus, any language  Lemmatised, part-of-speech tagged –specification of grammatical relations  Word sketches integrated with  Corpus query system –Supports complex searching, sorting etc  Credit: Pavel Rychly, Masaryk Univ

IDS Mannheim 2010Kilgarriff21 Customers  Dictionary publishers –Oxford University Press –Cambridge University Press –Collins –Macmillan –FrameNet Project (Berkeley, US) –National dictionary projects in  Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia  Universities –Teaching and research –Languages, linguistics, language technology –UK, Germany, US, Greece, Taiwan, Japan, China, Slovenia,…  Other –Language teaching, textbook writing –Information management, web search companies –Automatic translation

IDS Mannheim 2010Kilgarriff22 Web corpora  Replaceable or replacable? – –

IDS Mannheim 2010Kilgarriff23  The web is –Very very large –Most languages –Most language types –Up-to-date –Free –Instant access

IDS Mannheim 2010Kilgarriff24 Web corpus types  Large, general corpora  Small, specialised corpora –Specially for translators

IDS Mannheim 2010Kilgarriff25 Basic steps  Gather pages –CSE hits –Select and gather whole sites –General crawl  Filter  De-duplicate  Linguistic processing  Load into corpus tool

IDS Mannheim 2010Kilgarriff26 WaC family corpora  100m – 2b word corpora  2-month project each  All major world languages available in Sketch Engine –Currently 30 languages –Growing monthly  Pioneers: Marco Baroni, Serge Sharoff  Corpus Factory  Seeds: –mid-frequency words from ‘core vocab’ lists and corpora  Google on seed words, then crawl

IDS Mannheim 2010Kilgarriff27 Corpora Arabic174Hindi31Russian188 Chinese456Indonesian102Slovak536 Czech800Irish34Slovene738 Dutch128Italian1910Spanish117 English5508Japanese409Swedish114 French126Norwegian95Telugu5 German1627Persian6Thai108 Greek149Portuguese66Vietnamese174 Estonian11Romanian53Welsh63 Korean77Polish156Malay230

IDS Mannheim 2010Kilgarriff28 How good are they?  How to assess? –Hard question, open research topic  Good coverage –Newspapers: news, politics bias –Web corpora: also cover personal, kitchen vocab  Web corpus / BNC / journalism corpus –First two are close

IDS Mannheim 2010Kilgarriff29 Evaluating word sketches  11 years –  Feedback –Good but anecdotal  Formal evaluation  Method also lets us evaluate corpora

Kilgarriff30 Goal  Collocations dictionary –Model: Oxford Collocations Dictionary –Publication-quality  Ask a lexicographer –For 42 headwords  For 20 best collocates per headwords –“should we include this collocation in a published dictionary?”

Kilgarriff31 Sample of headwords  Nouns verbs adjectives, random  High (Top 3000)‏  N space solution opinion mass corporation leader  V serve incorporate mix desire  Adj high detailed open academic  Mid ( )‏  N cattle repayment fundraising elder biologist sanitation  V grieve classify ascertain implant  Adj adjacent eldest prolific ill  Low (10, ,000)‏  N predicament adulterer bake bombshell candy shellfish  V slap outgrow plow traipse  Adj neoclassical votive adulterous expandable

Kilgarriff32 Precision and recall We test precision Recall is harder How do we find all the collocations that the system should have found? Current work 200 collocates per headword Selected from All the corpora we have Various parameter settings Plus just-in-time evaluation for 'new' collocates

Kilgarriff33 Four languages, three families  Dutch –ANW, 102m-word lexicographic corpus  English –UKWaC, 1.5b web corpus  Japanese –JpWaC, 400m web corpus  Slovene –FidaPlus, 620m lexicographic corpus

Kilgarriff34 User evaluation  Evaluate whole system –Will it help with my task  Eg preparing a collocations dictionary  Contrast: developer evaluation –Can I make the system better?  Evaluate each module separately  Current work

Kilgarriff35 Components  Corpus  NLP tools –Segmenter, lemmatiser, POS-tagger  Sketch grammar  Statistics

Kilgarriff36 Practicalities  Interface –Good, Good-but  Merge to good –Maybe, Maybe-specialised, Bad  Merge to bad  For each language –Two/three linguists/lexicographers –If they disagree  Don't use for computing performance

Kilgarriff37 Results  Dutch 66%  English71%  Japanese87%  Slovene71%

IDS Mannheim 2010Kilgarriff38 Two thirds of a collocations dictionary can be gathered automatically

IDS Mannheim 2010Kilgarriff39 Small specialised corpora  Terminologists  Translators needing target-language domain-specific vocab  Specialist dictionaries –Don’t exist –Expensive/inaccessible –Out of date  Instant small web corpora –BootCaT: Baroni and Bernardini 2004 –WebBootCaT demo

IDS Mannheim 2010Kilgarriff40 Cyborgs  A creature that is partly human and partly machine –Macmillan English Dictionary

IDS Mannheim 2010Kilgarriff41

IDS Mannheim 2010Kilgarriff42

IDS Mannheim 2010Kilgarriff43

IDS Mannheim 2010Kilgarriff44

IDS Mannheim 2010Kilgarriff45 Cyborgs and the Information Society The dictionary-making agent is part human (for precision), part computer (for recall).

IDS Mannheim 2010Kilgarriff46 Treat your computer with respect. You and it can do great things together.

IDS Mannheim 2010Kilgarriff47 Thank you

IDS Mannheim 2010Kilgarriff48 Corpus and dictionary  Established model: –Lexicographers use corpora, users use dictionaries  But –Users like collocations, examples –But are not corpus linguists  Explore the space between corpus and dictionary

IDS Mannheim 2010Kilgarriff49 Collocationality  Which words are most ‘collocational’  Dictionary publishers –Where to put ‘collocation boxes’  Language learners

IDS Mannheim 2010Kilgarriff50 VerbFreqMLE Prob x log = entropy Take Gain Offer See Enjoy ……… Clarify ……… Total Calculation of entropy for advantage (object relation)

IDS Mannheim 2010Kilgarriff51

IDS Mannheim 2010Kilgarriff52 place (17881), attention (8476), door (8426), care (4884), step (4277), advantage (3730), rise (3334), attempt (2825), impression (2596), notice (2462), chapter (2318), mistake (2205), breath (2140), hold (1949), birth (1016), living (953), indication (812), tribute (720), debut (714), button (661), eyebrow (649), anniversary (637), mention (615), glimpse (531), suicide (486), toll (472), refuge (470), spokesman (453), sigh (436), birthday (429), wicket (412), appendix (410), pardon (399), precaution (396), temptation (374), goodbye (372), fuss (366), resemblance (350), goodness (288), precedence (285), havoc (270), tennis (266), comeback (260), farewell (228), prominence (228), go-ahead (202), sip (198),

IDS Mannheim 2010Kilgarriff53 What is there on the web?  Web1T –Present from google –All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English  1,000,000,000,000  Compare with BNC –Take top 50,000 items of each –105 Web1T words not in BNC top50k –50 words with highest Web1T:BNC ratio –50 words with lowest ratio

IDS Mannheim 2010Kilgarriff54 Web-high (155 terms)  61 web and computing –config browser spyware url www forum  38 porn  22 US English  18 business/products common on web –poker viagra lingerie ringtone dvd casino rental collectible tiffany –NB: BNC is old  4 legal –trademarks pursuant accordance herein

IDS Mannheim 2010Kilgarriff55 Web-low  Exclude British English, transcription/tokenisation anomalies –herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

IDS Mannheim 2010Kilgarriff56 Observations  Pronouns and past tense verbs –Fiction  Masc vs fem  Yesterday –Probably daily newspapers  Constancy of ratios: –He/him/himself –She/her/herself