Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

Similar presentations


Presentation on theme: "1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,"— Presentation transcript:

1 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton, UK

2 27-29 Aug 2003Adam Kilgarriff: Us precision them recall2 Outline  Precision and recall  History of corpus lexicography  Natural Language Processing  Cyborgs

3 27-29 Aug 2003Adam Kilgarriff: Us precision them recall3 Find me all the fat cats  a request for information

4 27-29 Aug 2003Adam Kilgarriff: Us precision them recall4 High recall  Lots of responses  Maybe not all good

5 27-29 Aug 2003Adam Kilgarriff: Us precision them recall5 High precision  Fewer hits  Higher confidence

6 27-29 Aug 2003Adam Kilgarriff: Us precision them recall6 Us precision, them recall RecallPrecision Computers good bad People bad good

7 27-29 Aug 2003Adam Kilgarriff: Us precision them recall7 Us precision, them recall  True in many areas –web searching, google –finding an image to illustrate a talk  Nowhere more so than lexicography

8 27-29 Aug 2003Adam Kilgarriff: Us precision them recall8 Lexicography: finding facts about words  collocations  grammatical patterns  idioms  synonyms  antonyms  meanings  translations

9 27-29 Aug 2003Adam Kilgarriff: Us precision them recall9 Outline  Precision and recall  History of corpus lexicography  Natural Language Processing  Cyborgs

10 27-29 Aug 2003Adam Kilgarriff: Us precision them recall10 Four ages of corpus lexicography

11 27-29 Aug 2003Adam Kilgarriff: Us precision them recall11 Age 1: Pre computer Oxford English Dictionary: 5 million index cards

12 27-29 Aug 2003Adam Kilgarriff: Us precision them recall12 Age 2: KWIC Concordances  From 1980  Computerised  COBUILD project was innovator  asian-kwic.html asian-kwic.html  the coloured-pens method

13 27-29 Aug 2003Adam Kilgarriff: Us precision them recall13 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no

14 27-29 Aug 2003Adam Kilgarriff: Us precision them recall14 Age 3: Collocation statistics  Problem: too much data - how to summarise?  Solution: list of words occurring in neighbourhood of headword, with frequencies  Sorted by salience

15 27-29 Aug 2003Adam Kilgarriff: Us precision them recall15 Collocation listing For right collocates of save (>5 hits) wordfr(x+y)fr(y)wordfr(x+y)fr(y) forests6170life364875 $1.26180dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141

16 27-29 Aug 2003Adam Kilgarriff: Us precision them recall16 Collocation statistics  Which words? –next word –last word –window, +1 to +5; window, -5 to -1  How sorted?  most common collocates --but for most nouns it's the  most salient collocates --how to measure salience?

17 27-29 Aug 2003Adam Kilgarriff: Us precision them recall17 Mutual Information  Church and Hanks 1989  How much more often does a word pair occur, than one might expect by chance  “Chance” of x and y occurring together: p(x) * p(y)  Probabilities approximated by frequencies p(x) =(approx) f(x)/N

18 27-29 Aug 2003Adam Kilgarriff: Us precision them recall18 Mutual Information Xfr eatfr Xfr eat X MI*rank it1000400, 000 4001/ 1M 3 meat1000600012020/ 1M 2 sushi1000100880/ 1M 1 * numbers are log-proportional to MI

19 27-29 Aug 2003Adam Kilgarriff: Us precision them recall19 Problem  mathematical salience = lexicographic salience?  no! higher-frequency items are lexicographically more salient  Solution multiply MI by raw frequency

20 27-29 Aug 2003Adam Kilgarriff: Us precision them recall20 Mutual Information Xfr eatfr Xfr eat X MI rank it1000400, 000 4001/M3 meat1000600012020/M2 sushi1000100880/M1 MI x frnew rank 400/ M 3 2400 /M 1 640/ M 2

21 27-29 Aug 2003Adam Kilgarriff: Us precision them recall21 Collocation listing For right collocates of save (>5 hits) wordfr(x+y)fr(y)wordfr(x+y)fr(y) forests6170life364875 $1.26180dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141

22 27-29 Aug 2003Adam Kilgarriff: Us precision them recall22 Age-3 collocation statistics: limitations Lists contain  junk  unsorted for type --MI lists mix adverbs, subjects, objects, prepositions What we really want:  noise-free lists  one list for each grammatical relation

23 27-29 Aug 2003Adam Kilgarriff: Us precision them recall23 Age 4: The word sketch  Large well-balanced corpus  Parse to find – subjects, objects, heads, modifiers etc  One list for each grammatical relation  Statistics to sort each list, as before

24 27-29 Aug 2003Adam Kilgarriff: Us precision them recall24 Can we do it?  high-accuracy parsing is hard  lots of NLP work, many parsing frameworks exist  if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant

25 27-29 Aug 2003Adam Kilgarriff: Us precision them recall25 Can we do it?  high-accuracy parsing is hard  lots of NLP work, many parsing frameworks exist  if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant  Poor man’s parsing: –object (of active verb) = last noun in any sequence of nouns, adjectives, determiners, numbers and adverbs following the verb

26 27-29 Aug 2003Adam Kilgarriff: Us precision them recall26 Can we do it?  high-accuracy parsing is hard  lots of NLP work, many parsing frameworks exist  if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant  Poor man’s parsing: –object (of active verb) = last noun in any sequence of nouns, adjectives, determiners, numbers and adverbs following the verb

27 27-29 Aug 2003Adam Kilgarriff: Us precision them recall27 The word sketch  British National Corpus (BNC) –100 M words, already POS-tagged  lemmatized using John Carroll's lemmatizer  poor man’s parsing  database of 70 million triples  coffee_n.html coffee_n.html

28 27-29 Aug 2003Adam Kilgarriff: Us precision them recall28 Macmillan Dictionary of English for Advanced Leaners, 2002 Editor: Rundell. Work done 1999.  6000 word sketches –most common nouns, verbs, adjectives of English  HTML files with hyperlinked corpus examples  lexicographers used them extensively – main use of corpus  positive feedback

29 27-29 Aug 2003Adam Kilgarriff: Us precision them recall29 The WASPbench  with David Tugwell, UK EPSRC, grant M54971 A lexicographer's workbench  runtime creation of word sketches  integration with Word Sense Disambiguation technology  output is "disambiguating dictionary" - analysis of word's meaning into senses, plus computer program for disambiguating contextualised instances of the word  First release now available. http://wasps.itri.brighton.ac.uk/

30 27-29 Aug 2003Adam Kilgarriff: Us precision them recall30 The Sketch Engine  Input: –any corpus, any language  Lemmatised, part-of-speech tagged –specification of grammatical relations  Word sketches integrated with  Corpus query system –Supports complex searching, sorting etc  First release early 2004

31 27-29 Aug 2003Adam Kilgarriff: Us precision them recall31 Outline  Precision and recall  History of corpus lexicography  Natural Language Processing  Cyborgs

32 27-29 Aug 2003Adam Kilgarriff: Us precision them recall32 Natural Language Processing  The academic discipline which provides the tools –Also known as Computational Linguistics, Human Language Technology (HLT), Language Engineering  Good at evaluation of its tools  Good news for lexicography: –identify the best tools, apply them to our corpora

33 27-29 Aug 2003Adam Kilgarriff: Us precision them recall33 An Anglophone Apology  Technology, tools, resources most often available for English  This talk centres on English  Other languages often present new problems –Finding word delimiters for Chinese is hard –Finding bunsetsu for Japanese is hard  Fewer resources available, less work done  Recommendation: –find the local experts for your language

34 27-29 Aug 2003Adam Kilgarriff: Us precision them recall34 Recap: Lexicography: finding facts about words  collocations  grammatical patterns  idioms  synonyms  antonyms  meanings  translations

35 27-29 Aug 2003Adam Kilgarriff: Us precision them recall35 Recap: Lexicography: finding facts about words  collocations - sketches  grammatical patterns - sketches  idioms  synonyms  antonyms  meanings  translations

36 27-29 Aug 2003Adam Kilgarriff: Us precision them recall36 Idioms  Extreme case of collocation/multi word expressions  Sequence of workshops on collocations, MWE  Technical terms (of great interest to technologists, technical): TERMIGHT

37 27-29 Aug 2003Adam Kilgarriff: Us precision them recall37 Antonyms  Essential semantic relation

38 27-29 Aug 2003Adam Kilgarriff: Us precision them recall38 Antonyms  Essential semantic relation but  Justeson and Katz 1995: distributional evidence for typical antonym pairs –rich men and poor men –the big ones and the small ones –black and white issues  Perhaps antonyms are ‘really’ distributional

39 27-29 Aug 2003Adam Kilgarriff: Us precision them recall39 Thesauruses  Also near-synonyms –are there any true synonyms?  Distributional: which words share same distributions –if corpus contains, –1 pt similarity between wine and beer –gather all points; find nearest neighbours  Sparck Jones, Lin, Grefenstette

40 27-29 Aug 2003Adam Kilgarriff: Us precision them recall40 Nearest neighbours  In WASPbench  Will be generated in Sketch Engine NOUNS zebra: giraffe buffalo hippopotamus rhinoceros gazelle antelope cheetah hippo leopard kangaroo crocodile deer rhino herbivore tortoise primate hyena camel scorpion macaque elephant mammoth alligator carnivore squirrel tiger newt chimpanzee monkey

41 27-29 Aug 2003Adam Kilgarriff: Us precision them recall41 exception: exemption limitation exclusion instance modification restriction recognition extension contrast addition refusal example clause indication definition error restraint reference objection consideration concession distinction variation occurrence anomaly offence jurisdiction implication analogy pot: bowl pan jar container dish jug mug tin tub tray bag saucepan bottle basket bucket vase plate kettle teapot glass spoon soup box can cake tea packet pipe cup

42 27-29 Aug 2003Adam Kilgarriff: Us precision them recall42 VERBS measure determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust boil simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften

43 27-29 Aug 2003Adam Kilgarriff: Us precision them recall43 ADJECTIVES hypnotic haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky pink purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp

44 27-29 Aug 2003Adam Kilgarriff: Us precision them recall44 Translation  Parallel corpora –Texts and their translations or  Comparable corpora –Matched for source and target (genre and subject matter), not translations  Which L1 words occur in equivalent L1 settings to L2 words in L2 settings? –They are candidate translation pairs  Very hard problem  Lots of high quality research

45 27-29 Aug 2003Adam Kilgarriff: Us precision them recall45 Outline  Precision and recall  History of corpus lexicography  Natural Language Processing  Cyborgs

46 27-29 Aug 2003Adam Kilgarriff: Us precision them recall46 Cyborgs  Robots: will they take over?  Rod Brooks’s answer: –Wrong question: greatest advances are in what the human+computer ensemble can do

47 27-29 Aug 2003Adam Kilgarriff: Us precision them recall47 Cyborgs  A creature that is partly human and partly machine –Macmillan English Dictionary

48 27-29 Aug 2003Adam Kilgarriff: Us precision them recall48

49 27-29 Aug 2003Adam Kilgarriff: Us precision them recall49

50 27-29 Aug 2003Adam Kilgarriff: Us precision them recall50

51 27-29 Aug 2003Adam Kilgarriff: Us precision them recall51

52 27-29 Aug 2003Adam Kilgarriff: Us precision them recall52 Cyborgs and the Information Society The dictionary-making agent is part human (for precision), part computer (for recall).

53 27-29 Aug 2003Adam Kilgarriff: Us precision them recall53 Treat your computer with respect. You and it can do great things together.

54 27-29 Aug 2003Adam Kilgarriff: Us precision them recall54 Lexicographers of the future?


Download ppt "1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,"

Similar presentations


Ads by Google