1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

Corpus Processing and NLP
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Evaluating the Waspbench A Lexicography Tool Incorporating Word Sense Disambiguation Rob Koeling, Adam Kilgarriff, David Tugwell, Roger Evans ITRI, University.
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Memory Strategy – Using Mental Images
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
1 Chinese WordSketch Engine Online, corpus-based summaries of word usage.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
1 Word senses: a computational response Adam Kilgarriff Auckland 2012Kilgarriff: Word senses: a computational response.
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Detecting a Continuum of Compositionality in Phrasal Verbs Diana McCarthy & Bill Keller & John Carroll University of Sussex This research was supported.
Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK.
Do we need lexicographers? Prospects for automatic lexicography Adam Kilgarriff Lexical Computing Ltd University of Leeds UK.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Sketch engine for Chinese Discussion notes. Wordsketch, subsequently Sketch Engine Was developed by Kilgarriff et al at Brighton Gives automatic, corpus-based.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
GDEX: Automatically finding good dictionary examples in a corpus.
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Statistical NLP: Lecture 7
Learning Usage of English KWICly with WebLEAP/DSR
Map Reduce.
Evaluating word sketches and corpora
Corpus Linguistics I ENG 617
Introduction to Corpus Linguistics: Exploring Collocation
Introduction to Corpus Linguistics: Applications Lexicography
Thesauruses for Natural Language Processing
Corpora, Language Technology and Maltese
Presentation transcript:

1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton, UK

27-29 Aug 2003Adam Kilgarriff: Us precision them recall2 Outline  Precision and recall  History of corpus lexicography  Natural Language Processing  Cyborgs

27-29 Aug 2003Adam Kilgarriff: Us precision them recall3 Find me all the fat cats  a request for information

27-29 Aug 2003Adam Kilgarriff: Us precision them recall4 High recall  Lots of responses  Maybe not all good

27-29 Aug 2003Adam Kilgarriff: Us precision them recall5 High precision  Fewer hits  Higher confidence

27-29 Aug 2003Adam Kilgarriff: Us precision them recall6 Us precision, them recall RecallPrecision Computers good bad People bad good

27-29 Aug 2003Adam Kilgarriff: Us precision them recall7 Us precision, them recall  True in many areas –web searching, google –finding an image to illustrate a talk  Nowhere more so than lexicography

27-29 Aug 2003Adam Kilgarriff: Us precision them recall8 Lexicography: finding facts about words  collocations  grammatical patterns  idioms  synonyms  antonyms  meanings  translations

27-29 Aug 2003Adam Kilgarriff: Us precision them recall9 Outline  Precision and recall  History of corpus lexicography  Natural Language Processing  Cyborgs

27-29 Aug 2003Adam Kilgarriff: Us precision them recall10 Four ages of corpus lexicography

27-29 Aug 2003Adam Kilgarriff: Us precision them recall11 Age 1: Pre computer Oxford English Dictionary: 5 million index cards

27-29 Aug 2003Adam Kilgarriff: Us precision them recall12 Age 2: KWIC Concordances  From 1980  Computerised  COBUILD project was innovator  asian-kwic.html asian-kwic.html  the coloured-pens method

27-29 Aug 2003Adam Kilgarriff: Us precision them recall13 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no

27-29 Aug 2003Adam Kilgarriff: Us precision them recall14 Age 3: Collocation statistics  Problem: too much data - how to summarise?  Solution: list of words occurring in neighbourhood of headword, with frequencies  Sorted by salience

27-29 Aug 2003Adam Kilgarriff: Us precision them recall15 Collocation listing For right collocates of save (>5 hits) wordfr(x+y)fr(y)wordfr(x+y)fr(y) forests6170life $ dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141

27-29 Aug 2003Adam Kilgarriff: Us precision them recall16 Collocation statistics  Which words? –next word –last word –window, +1 to +5; window, -5 to -1  How sorted?  most common collocates --but for most nouns it's the  most salient collocates --how to measure salience?

27-29 Aug 2003Adam Kilgarriff: Us precision them recall17 Mutual Information  Church and Hanks 1989  How much more often does a word pair occur, than one might expect by chance  “Chance” of x and y occurring together: p(x) * p(y)  Probabilities approximated by frequencies p(x) =(approx) f(x)/N

27-29 Aug 2003Adam Kilgarriff: Us precision them recall18 Mutual Information Xfr eatfr Xfr eat X MI*rank it , / 1M 3 meat / 1M 2 sushi / 1M 1 * numbers are log-proportional to MI

27-29 Aug 2003Adam Kilgarriff: Us precision them recall19 Problem  mathematical salience = lexicographic salience?  no! higher-frequency items are lexicographically more salient  Solution multiply MI by raw frequency

27-29 Aug 2003Adam Kilgarriff: Us precision them recall20 Mutual Information Xfr eatfr Xfr eat X MI rank it , /M3 meat /M2 sushi /M1 MI x frnew rank 400/ M /M 1 640/ M 2

27-29 Aug 2003Adam Kilgarriff: Us precision them recall21 Collocation listing For right collocates of save (>5 hits) wordfr(x+y)fr(y)wordfr(x+y)fr(y) forests6170life $ dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141

27-29 Aug 2003Adam Kilgarriff: Us precision them recall22 Age-3 collocation statistics: limitations Lists contain  junk  unsorted for type --MI lists mix adverbs, subjects, objects, prepositions What we really want:  noise-free lists  one list for each grammatical relation

27-29 Aug 2003Adam Kilgarriff: Us precision them recall23 Age 4: The word sketch  Large well-balanced corpus  Parse to find – subjects, objects, heads, modifiers etc  One list for each grammatical relation  Statistics to sort each list, as before

27-29 Aug 2003Adam Kilgarriff: Us precision them recall24 Can we do it?  high-accuracy parsing is hard  lots of NLP work, many parsing frameworks exist  if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant

27-29 Aug 2003Adam Kilgarriff: Us precision them recall25 Can we do it?  high-accuracy parsing is hard  lots of NLP work, many parsing frameworks exist  if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant  Poor man’s parsing: –object (of active verb) = last noun in any sequence of nouns, adjectives, determiners, numbers and adverbs following the verb

27-29 Aug 2003Adam Kilgarriff: Us precision them recall26 Can we do it?  high-accuracy parsing is hard  lots of NLP work, many parsing frameworks exist  if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant  Poor man’s parsing: –object (of active verb) = last noun in any sequence of nouns, adjectives, determiners, numbers and adverbs following the verb

27-29 Aug 2003Adam Kilgarriff: Us precision them recall27 The word sketch  British National Corpus (BNC) –100 M words, already POS-tagged  lemmatized using John Carroll's lemmatizer  poor man’s parsing  database of 70 million triples  coffee_n.html coffee_n.html

27-29 Aug 2003Adam Kilgarriff: Us precision them recall28 Macmillan Dictionary of English for Advanced Leaners, 2002 Editor: Rundell. Work done  6000 word sketches –most common nouns, verbs, adjectives of English  HTML files with hyperlinked corpus examples  lexicographers used them extensively – main use of corpus  positive feedback

27-29 Aug 2003Adam Kilgarriff: Us precision them recall29 The WASPbench  with David Tugwell, UK EPSRC, grant M54971 A lexicographer's workbench  runtime creation of word sketches  integration with Word Sense Disambiguation technology  output is "disambiguating dictionary" - analysis of word's meaning into senses, plus computer program for disambiguating contextualised instances of the word  First release now available.

27-29 Aug 2003Adam Kilgarriff: Us precision them recall30 The Sketch Engine  Input: –any corpus, any language  Lemmatised, part-of-speech tagged –specification of grammatical relations  Word sketches integrated with  Corpus query system –Supports complex searching, sorting etc  First release early 2004

27-29 Aug 2003Adam Kilgarriff: Us precision them recall31 Outline  Precision and recall  History of corpus lexicography  Natural Language Processing  Cyborgs

27-29 Aug 2003Adam Kilgarriff: Us precision them recall32 Natural Language Processing  The academic discipline which provides the tools –Also known as Computational Linguistics, Human Language Technology (HLT), Language Engineering  Good at evaluation of its tools  Good news for lexicography: –identify the best tools, apply them to our corpora

27-29 Aug 2003Adam Kilgarriff: Us precision them recall33 An Anglophone Apology  Technology, tools, resources most often available for English  This talk centres on English  Other languages often present new problems –Finding word delimiters for Chinese is hard –Finding bunsetsu for Japanese is hard  Fewer resources available, less work done  Recommendation: –find the local experts for your language

27-29 Aug 2003Adam Kilgarriff: Us precision them recall34 Recap: Lexicography: finding facts about words  collocations  grammatical patterns  idioms  synonyms  antonyms  meanings  translations

27-29 Aug 2003Adam Kilgarriff: Us precision them recall35 Recap: Lexicography: finding facts about words  collocations - sketches  grammatical patterns - sketches  idioms  synonyms  antonyms  meanings  translations

27-29 Aug 2003Adam Kilgarriff: Us precision them recall36 Idioms  Extreme case of collocation/multi word expressions  Sequence of workshops on collocations, MWE  Technical terms (of great interest to technologists, technical): TERMIGHT

27-29 Aug 2003Adam Kilgarriff: Us precision them recall37 Antonyms  Essential semantic relation

27-29 Aug 2003Adam Kilgarriff: Us precision them recall38 Antonyms  Essential semantic relation but  Justeson and Katz 1995: distributional evidence for typical antonym pairs –rich men and poor men –the big ones and the small ones –black and white issues  Perhaps antonyms are ‘really’ distributional

27-29 Aug 2003Adam Kilgarriff: Us precision them recall39 Thesauruses  Also near-synonyms –are there any true synonyms?  Distributional: which words share same distributions –if corpus contains, –1 pt similarity between wine and beer –gather all points; find nearest neighbours  Sparck Jones, Lin, Grefenstette

27-29 Aug 2003Adam Kilgarriff: Us precision them recall40 Nearest neighbours  In WASPbench  Will be generated in Sketch Engine NOUNS zebra: giraffe buffalo hippopotamus rhinoceros gazelle antelope cheetah hippo leopard kangaroo crocodile deer rhino herbivore tortoise primate hyena camel scorpion macaque elephant mammoth alligator carnivore squirrel tiger newt chimpanzee monkey

27-29 Aug 2003Adam Kilgarriff: Us precision them recall41 exception: exemption limitation exclusion instance modification restriction recognition extension contrast addition refusal example clause indication definition error restraint reference objection consideration concession distinction variation occurrence anomaly offence jurisdiction implication analogy pot: bowl pan jar container dish jug mug tin tub tray bag saucepan bottle basket bucket vase plate kettle teapot glass spoon soup box can cake tea packet pipe cup

27-29 Aug 2003Adam Kilgarriff: Us precision them recall42 VERBS measure determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust boil simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften

27-29 Aug 2003Adam Kilgarriff: Us precision them recall43 ADJECTIVES hypnotic haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky pink purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp

27-29 Aug 2003Adam Kilgarriff: Us precision them recall44 Translation  Parallel corpora –Texts and their translations or  Comparable corpora –Matched for source and target (genre and subject matter), not translations  Which L1 words occur in equivalent L1 settings to L2 words in L2 settings? –They are candidate translation pairs  Very hard problem  Lots of high quality research

27-29 Aug 2003Adam Kilgarriff: Us precision them recall45 Outline  Precision and recall  History of corpus lexicography  Natural Language Processing  Cyborgs

27-29 Aug 2003Adam Kilgarriff: Us precision them recall46 Cyborgs  Robots: will they take over?  Rod Brooks’s answer: –Wrong question: greatest advances are in what the human+computer ensemble can do

27-29 Aug 2003Adam Kilgarriff: Us precision them recall47 Cyborgs  A creature that is partly human and partly machine –Macmillan English Dictionary

27-29 Aug 2003Adam Kilgarriff: Us precision them recall48

27-29 Aug 2003Adam Kilgarriff: Us precision them recall49

27-29 Aug 2003Adam Kilgarriff: Us precision them recall50

27-29 Aug 2003Adam Kilgarriff: Us precision them recall51

27-29 Aug 2003Adam Kilgarriff: Us precision them recall52 Cyborgs and the Information Society The dictionary-making agent is part human (for precision), part computer (for recall).

27-29 Aug 2003Adam Kilgarriff: Us precision them recall53 Treat your computer with respect. You and it can do great things together.

27-29 Aug 2003Adam Kilgarriff: Us precision them recall54 Lexicographers of the future?