Download presentation
Presentation is loading. Please wait.
Published byDana Pentecost Modified over 10 years ago
1
1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton, UK
2
27-29 Aug 2003Adam Kilgarriff: Us precision them recall2 Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs
3
27-29 Aug 2003Adam Kilgarriff: Us precision them recall3 Find me all the fat cats a request for information
4
27-29 Aug 2003Adam Kilgarriff: Us precision them recall4 High recall Lots of responses Maybe not all good
5
27-29 Aug 2003Adam Kilgarriff: Us precision them recall5 High precision Fewer hits Higher confidence
6
27-29 Aug 2003Adam Kilgarriff: Us precision them recall6 Us precision, them recall RecallPrecision Computers good bad People bad good
7
27-29 Aug 2003Adam Kilgarriff: Us precision them recall7 Us precision, them recall True in many areas –web searching, google –finding an image to illustrate a talk Nowhere more so than lexicography
8
27-29 Aug 2003Adam Kilgarriff: Us precision them recall8 Lexicography: finding facts about words collocations grammatical patterns idioms synonyms antonyms meanings translations
9
27-29 Aug 2003Adam Kilgarriff: Us precision them recall9 Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs
10
27-29 Aug 2003Adam Kilgarriff: Us precision them recall10 Four ages of corpus lexicography
11
27-29 Aug 2003Adam Kilgarriff: Us precision them recall11 Age 1: Pre computer Oxford English Dictionary: 5 million index cards
12
27-29 Aug 2003Adam Kilgarriff: Us precision them recall12 Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator asian-kwic.html asian-kwic.html the coloured-pens method
13
27-29 Aug 2003Adam Kilgarriff: Us precision them recall13 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no
14
27-29 Aug 2003Adam Kilgarriff: Us precision them recall14 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience
15
27-29 Aug 2003Adam Kilgarriff: Us precision them recall15 Collocation listing For right collocates of save (>5 hits) wordfr(x+y)fr(y)wordfr(x+y)fr(y) forests6170life364875 $1.26180dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141
16
27-29 Aug 2003Adam Kilgarriff: Us precision them recall16 Collocation statistics Which words? –next word –last word –window, +1 to +5; window, -5 to -1 How sorted? most common collocates --but for most nouns it's the most salient collocates --how to measure salience?
17
27-29 Aug 2003Adam Kilgarriff: Us precision them recall17 Mutual Information Church and Hanks 1989 How much more often does a word pair occur, than one might expect by chance “Chance” of x and y occurring together: p(x) * p(y) Probabilities approximated by frequencies p(x) =(approx) f(x)/N
18
27-29 Aug 2003Adam Kilgarriff: Us precision them recall18 Mutual Information Xfr eatfr Xfr eat X MI*rank it1000400, 000 4001/ 1M 3 meat1000600012020/ 1M 2 sushi1000100880/ 1M 1 * numbers are log-proportional to MI
19
27-29 Aug 2003Adam Kilgarriff: Us precision them recall19 Problem mathematical salience = lexicographic salience? no! higher-frequency items are lexicographically more salient Solution multiply MI by raw frequency
20
27-29 Aug 2003Adam Kilgarriff: Us precision them recall20 Mutual Information Xfr eatfr Xfr eat X MI rank it1000400, 000 4001/M3 meat1000600012020/M2 sushi1000100880/M1 MI x frnew rank 400/ M 3 2400 /M 1 640/ M 2
21
27-29 Aug 2003Adam Kilgarriff: Us precision them recall21 Collocation listing For right collocates of save (>5 hits) wordfr(x+y)fr(y)wordfr(x+y)fr(y) forests6170life364875 $1.26180dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141
22
27-29 Aug 2003Adam Kilgarriff: Us precision them recall22 Age-3 collocation statistics: limitations Lists contain junk unsorted for type --MI lists mix adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation
23
27-29 Aug 2003Adam Kilgarriff: Us precision them recall23 Age 4: The word sketch Large well-balanced corpus Parse to find – subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before
24
27-29 Aug 2003Adam Kilgarriff: Us precision them recall24 Can we do it? high-accuracy parsing is hard lots of NLP work, many parsing frameworks exist if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant
25
27-29 Aug 2003Adam Kilgarriff: Us precision them recall25 Can we do it? high-accuracy parsing is hard lots of NLP work, many parsing frameworks exist if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant Poor man’s parsing: –object (of active verb) = last noun in any sequence of nouns, adjectives, determiners, numbers and adverbs following the verb
26
27-29 Aug 2003Adam Kilgarriff: Us precision them recall26 Can we do it? high-accuracy parsing is hard lots of NLP work, many parsing frameworks exist if any parser can handle large corpus, it's probably good enough --- sorting, statistics, make us error-tolerant Poor man’s parsing: –object (of active verb) = last noun in any sequence of nouns, adjectives, determiners, numbers and adverbs following the verb
27
27-29 Aug 2003Adam Kilgarriff: Us precision them recall27 The word sketch British National Corpus (BNC) –100 M words, already POS-tagged lemmatized using John Carroll's lemmatizer poor man’s parsing database of 70 million triples coffee_n.html coffee_n.html
28
27-29 Aug 2003Adam Kilgarriff: Us precision them recall28 Macmillan Dictionary of English for Advanced Leaners, 2002 Editor: Rundell. Work done 1999. 6000 word sketches –most common nouns, verbs, adjectives of English HTML files with hyperlinked corpus examples lexicographers used them extensively – main use of corpus positive feedback
29
27-29 Aug 2003Adam Kilgarriff: Us precision them recall29 The WASPbench with David Tugwell, UK EPSRC, grant M54971 A lexicographer's workbench runtime creation of word sketches integration with Word Sense Disambiguation technology output is "disambiguating dictionary" - analysis of word's meaning into senses, plus computer program for disambiguating contextualised instances of the word First release now available. http://wasps.itri.brighton.ac.uk/
30
27-29 Aug 2003Adam Kilgarriff: Us precision them recall30 The Sketch Engine Input: –any corpus, any language Lemmatised, part-of-speech tagged –specification of grammatical relations Word sketches integrated with Corpus query system –Supports complex searching, sorting etc First release early 2004
31
27-29 Aug 2003Adam Kilgarriff: Us precision them recall31 Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs
32
27-29 Aug 2003Adam Kilgarriff: Us precision them recall32 Natural Language Processing The academic discipline which provides the tools –Also known as Computational Linguistics, Human Language Technology (HLT), Language Engineering Good at evaluation of its tools Good news for lexicography: –identify the best tools, apply them to our corpora
33
27-29 Aug 2003Adam Kilgarriff: Us precision them recall33 An Anglophone Apology Technology, tools, resources most often available for English This talk centres on English Other languages often present new problems –Finding word delimiters for Chinese is hard –Finding bunsetsu for Japanese is hard Fewer resources available, less work done Recommendation: –find the local experts for your language
34
27-29 Aug 2003Adam Kilgarriff: Us precision them recall34 Recap: Lexicography: finding facts about words collocations grammatical patterns idioms synonyms antonyms meanings translations
35
27-29 Aug 2003Adam Kilgarriff: Us precision them recall35 Recap: Lexicography: finding facts about words collocations - sketches grammatical patterns - sketches idioms synonyms antonyms meanings translations
36
27-29 Aug 2003Adam Kilgarriff: Us precision them recall36 Idioms Extreme case of collocation/multi word expressions Sequence of workshops on collocations, MWE Technical terms (of great interest to technologists, technical): TERMIGHT
37
27-29 Aug 2003Adam Kilgarriff: Us precision them recall37 Antonyms Essential semantic relation
38
27-29 Aug 2003Adam Kilgarriff: Us precision them recall38 Antonyms Essential semantic relation but Justeson and Katz 1995: distributional evidence for typical antonym pairs –rich men and poor men –the big ones and the small ones –black and white issues Perhaps antonyms are ‘really’ distributional
39
27-29 Aug 2003Adam Kilgarriff: Us precision them recall39 Thesauruses Also near-synonyms –are there any true synonyms? Distributional: which words share same distributions –if corpus contains, –1 pt similarity between wine and beer –gather all points; find nearest neighbours Sparck Jones, Lin, Grefenstette
40
27-29 Aug 2003Adam Kilgarriff: Us precision them recall40 Nearest neighbours In WASPbench Will be generated in Sketch Engine NOUNS zebra: giraffe buffalo hippopotamus rhinoceros gazelle antelope cheetah hippo leopard kangaroo crocodile deer rhino herbivore tortoise primate hyena camel scorpion macaque elephant mammoth alligator carnivore squirrel tiger newt chimpanzee monkey
41
27-29 Aug 2003Adam Kilgarriff: Us precision them recall41 exception: exemption limitation exclusion instance modification restriction recognition extension contrast addition refusal example clause indication definition error restraint reference objection consideration concession distinction variation occurrence anomaly offence jurisdiction implication analogy pot: bowl pan jar container dish jug mug tin tub tray bag saucepan bottle basket bucket vase plate kettle teapot glass spoon soup box can cake tea packet pipe cup
42
27-29 Aug 2003Adam Kilgarriff: Us precision them recall42 VERBS measure determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust boil simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften
43
27-29 Aug 2003Adam Kilgarriff: Us precision them recall43 ADJECTIVES hypnotic haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky pink purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp
44
27-29 Aug 2003Adam Kilgarriff: Us precision them recall44 Translation Parallel corpora –Texts and their translations or Comparable corpora –Matched for source and target (genre and subject matter), not translations Which L1 words occur in equivalent L1 settings to L2 words in L2 settings? –They are candidate translation pairs Very hard problem Lots of high quality research
45
27-29 Aug 2003Adam Kilgarriff: Us precision them recall45 Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs
46
27-29 Aug 2003Adam Kilgarriff: Us precision them recall46 Cyborgs Robots: will they take over? Rod Brooks’s answer: –Wrong question: greatest advances are in what the human+computer ensemble can do
47
27-29 Aug 2003Adam Kilgarriff: Us precision them recall47 Cyborgs A creature that is partly human and partly machine –Macmillan English Dictionary
48
27-29 Aug 2003Adam Kilgarriff: Us precision them recall48
49
27-29 Aug 2003Adam Kilgarriff: Us precision them recall49
50
27-29 Aug 2003Adam Kilgarriff: Us precision them recall50
51
27-29 Aug 2003Adam Kilgarriff: Us precision them recall51
52
27-29 Aug 2003Adam Kilgarriff: Us precision them recall52 Cyborgs and the Information Society The dictionary-making agent is part human (for precision), part computer (for recall).
53
27-29 Aug 2003Adam Kilgarriff: Us precision them recall53 Treat your computer with respect. You and it can do great things together.
54
27-29 Aug 2003Adam Kilgarriff: Us precision them recall54 Lexicographers of the future?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.