1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Slides:



Advertisements
Similar presentations
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Advertisements

Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing.
The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
How to evaluate a corpus Adam Kilgarriff with: Vit Baisa, Milos Jakubicek, Vojtech Kovar, Pavel Rychly Lexical Computing Ltd and Leeds University / FI,
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Evaluating the Waspbench A Lexicography Tool Incorporating Word Sense Disambiguation Rob Koeling, Adam Kilgarriff, David Tugwell, Roger Evans ITRI, University.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
L EARNERS ’ D ICTIONARY Deny A. Kwary
Macrostructure  Front matter  Body  Appendices Jackson, Howard Lexicography: An Introduction. London: Routledge, p. 25.
1 Chinese WordSketch Online, corpus-based summaries of word usage.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Memory Strategy – Using Mental Images
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Researching language with computers Paul Thompson.
1 Chinese WordSketch Engine Online, corpus-based summaries of word usage.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Without data, nothing Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
TALC Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
1 Word senses: a computational response Adam Kilgarriff Auckland 2012Kilgarriff: Word senses: a computational response.
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK.
Do we need lexicographers? Prospects for automatic lexicography Adam Kilgarriff Lexical Computing Ltd University of Leeds UK.
1 Word senses: a computational response Adam Kilgarriff.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Sketch engine for Chinese Discussion notes. Wordsketch, subsequently Sketch Engine Was developed by Kilgarriff et al at Brighton Gives automatic, corpus-based.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
Learners' Dictionaries Oxford1948 Longman1978 Collins COBUILD1987 Macmillan2002 Macmillan2008 (bilingualized) Merriam-Webster2008 Jackson, Howard
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
1 Word senses: a computational response Adam Kilgarriff.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
GDEX: Automatically finding good dictionary examples in a corpus.
Making useful wordlists for ELT
Computational and Statistical Methods for Corpus Analysis: Overview
Evaluating word sketches and corpora
Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3
Corpora, Language Technology and Maltese
Presentation transcript:

1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Geneva, April 2010 Adam Kilgarriff 2 Overview Research programme Examples: Case study Word sketching Evaluating word sketches

Geneva, April 2010 Adam Kilgarriff 3 What is language?

Geneva, April 2010 Adam Kilgarriff 4 What is language? In our heads

Geneva, April 2010 Adam Kilgarriff 5 What is language? In our heads In texts and sound signals

Geneva, April 2010 Adam Kilgarriff 6 What is language? In our heads In texts and sound signals Both

Geneva, April 2010 Adam Kilgarriff 7 Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz)‏

Geneva, April 2010 Adam Kilgarriff 8 Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz)‏ Odd method for objective science Practical problems: coverage, arbitrariness

Geneva, April 2010 Adam Kilgarriff 9 Methodology Study text “empiricist” (Locke, Hume)‏ Physics: forces, matter Chemistry: chemicals, bonds Language: text, speech signals

Geneva, April 2010 Adam Kilgarriff 10 It goes against the grain What is important about a sentence? its meaning Corpus methodology: Throw away individual sentence meaning Find patterns

Geneva, April 2010 Adam Kilgarriff 11 Twenty years of rapid ascent Computer power Corpora bigger and bigger data sets Language technology tools lemmatizers, POS-taggers, parsers Machine learning, pattern-finding

Geneva, April 2010 Adam Kilgarriff 12 A virtuous circle Pattern finding Linguistic processing Corpus Lexicon Part-of-speech tagging Parsing Lemmatizing More data → gets richer each time round

Geneva, April 2010 Adam Kilgarriff 13 Case study: corpus lexicography - four ages

Geneva, April 2010 Adam Kilgarriff 14 Age 1: Pre-computer Oxford English Dictionary: 20 million index cards

Geneva, April 2010 Adam Kilgarriff 15 Age 2: KWIC Concordances From 1980 Computerised

Geneva, April 2010 Adam Kilgarriff 16 Age 2: KWIC Concordance

Geneva, April 2010 Adam Kilgarriff 17 Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator the coloured-pens method

Geneva, April 2010 Adam Kilgarriff 18 The coloured pens method

Geneva, April 2010 Adam Kilgarriff 19 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: read all 500 lines: could read all, takes a long time 5000 lines: no

Geneva, April 2010 Adam Kilgarriff 20 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience

Geneva, April 2010 Adam Kilgarriff 21 Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword word yourmoney estimatedjobs faceannually thousandsenormous costslives dollars$1.2 lifeforests

Geneva, April 2010 Adam Kilgarriff 22 Age 4: The word sketch A corpus-derived one-page summary of a word’s grammatical and collocational behaviour

Geneva, April 2010 Adam Kilgarriff 23 Age 4: The word sketch Large corpus Parse to find subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before

Geneva, April 2010 Adam Kilgarriff 24 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002

Geneva, April 2010 Adam Kilgarriff 25 Euralex 2002

Geneva, April 2010 Adam Kilgarriff 26 Euralex 2002 Can I have them for my language please

Geneva, April 2010 Adam Kilgarriff 27 The Sketch Engine Input: any corpus, any language Lemmatised, part-of-speech tagged specification of grammatical relations Word sketches integrated with corpus query system Developer: Pavel Rychly, Brno

Geneva, April 2010 Adam Kilgarriff 28 Users: Dictionary publishers Oxford UP, Collins, Chambers, Macmillan, Cambridge UP Universities Teaching, research Framenet Language teaching Self-registration for free trial account

Geneva, April 2010 Adam Kilgarriff 29 Lexical Computing Ltd Since 2003 Directors Adam Kilgarriff (UK), Pavel Rychly (Cz), Diana McCarthy (UK, since Oct 2009)‏ Main activities Sketch engine service Corpus development Research-led

Geneva, April 2010 Adam Kilgarriff 30 (demo)‏

Geneva, April 2010 Adam Kilgarriff 31 Evaluating word sketches 10 years Feedback Good but anecdotal Formal evaluation

Geneva, April 2010 Adam Kilgarriff 32 Goal Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality Ask a lexicographer For 42 headwords For 20 best collocates per headwords “should we include this collocation in a published dictionary?”

Geneva, April 2010 Adam Kilgarriff 33 Sample of headwords Nouns verbs adjectives, random High (Top 3000)‏ N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid ( )‏ N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10, ,000)‏ N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable

Geneva, April 2010 Adam Kilgarriff 34 Precision and recall We test precision Recall is harder How do we find all the collocations that the system should have found? Current work 200 collocates per headword Selected from All the corpora we have Various parameter settings Plus just-in-time evaluation for 'new' collocates

Geneva, April 2010 Adam Kilgarriff 35 Four languages, three families Dutch ANW, 102m-word lexicographic corpus English UKWaC, 1.5b web corpus Japanese JpWaC, 400m web corpus Slovene FidaPlus, 620m lexicographic corpus

Geneva, April 2010 Adam Kilgarriff 36 User evaluation Evaluate whole system Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation Can I make the system better? Evaluate each module separately Current work

Geneva, April 2010 Adam Kilgarriff 37 Components Corpus NLP tools Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics

Geneva, April 2010 Adam Kilgarriff 38 Practicalities Interface Good, Good-but Merge to good Maybe, Maybe-specialised, Bad Merge to bad For each language Two/three linguists/lexicographers If they disagree Don't use for computing performance

Geneva, April 2010 Adam Kilgarriff 39 Results Dutch 66% English71% Japanese87% Slovene71%

Geneva, April 2010 Adam Kilgarriff 40 Corpus evaluation Collocation-finding Typical corpus task Recall Hold all else constant Statistic, NLP tools, grammar Best results: best corpus (for collocation-finding)‏ Pomikalek: de-duplication

Geneva, April 2010 Adam Kilgarriff 41 Other topics Dante a new lexical database for English Corpus building (mostly from the web)‏ Instant corpora with WebBootCaT Bigger and better (English)‏ BiWeC and New Model Corpus Corpus Factory (many languages)‏ Corpus comparison, similarity, evaluation Statistics: collocations, keyword lists Word frequency lists Word senses and lexicography SADD: semi-automatic dictionary drafting

Geneva, April 2010 Adam Kilgarriff 42 Thank you

Geneva, April 2010 Adam Kilgarriff 43 Words and word senses automatic thesauruses words

Geneva, April 2010 Adam Kilgarriff 44 Words and word senses automatic thesauruses words manual thesauruses simple hierarchy is appealing homonyms

Geneva, April 2010 Adam Kilgarriff 45 Words and word senses automatic thesauruses words manual thesauruses simple hierarchy is appealing homonyms “aha! objects must be word senses”

Geneva, April 2010 Adam Kilgarriff 46 Problems Theoretical Practical

Geneva, April 2010 Adam Kilgarriff 47 Theoretical

Geneva, April 2010 Adam Kilgarriff 48

Geneva, April 2010 Adam Kilgarriff 49

Geneva, April 2010 Adam Kilgarriff 50 Wittgenstein Don’t ask for the meaning, ask for the use

Geneva, April 2010 Adam Kilgarriff 51 Practical

Geneva, April 2010 Adam Kilgarriff 52 Problems Practical a thesaurus is a tool if the tool organises words senses you must do WSD before you can use it WSD: state of the art, optimal conditions: 80%

Geneva, April 2010 Adam Kilgarriff 53 Problems “To use this tool, first replace one fifth of your input with junk”

Geneva, April 2010 Adam Kilgarriff 54 Avoid word senses

Geneva, April 2010 Adam Kilgarriff 55 Avoid word senses This word has three meanings/senses

Geneva, April 2010 Adam Kilgarriff 56 Avoid word senses This word has three meanings/senses This word has three kinds of use well founded empirical we can build on it