Download presentation
Presentation is loading. Please wait.
Published byJuliet Shelton Modified over 9 years ago
1
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds & Sussex, UK
2
October 2009Kilgarriff: FLTRP2 Outline Precision and recall History of corpus lexicography Sketch Engine –demo Automatic Collocations Dictionary –demo Electronic dictionaries
3
October 2009Kilgarriff: FLTRP3 Find me all the fat cats a request for information
4
October 2009Kilgarriff: FLTRP4 High recall Lots of responses Maybe not all good
5
October 2009Kilgarriff: FLTRP5 High precision Fewer hits Higher confidence
6
October 2009Kilgarriff: FLTRP6 Us precision, them recall RecallPrecision Computers good bad People bad good
7
October 2009Kilgarriff: FLTRP7 Us precision, them recall True in many areas –web searching, google –finding an image to illustrate a talk Nowhere more so than lexicography
8
October 2009Kilgarriff: FLTRP8 Lexicography: finding facts about words collocations grammatical patterns idioms synonyms antonyms meanings translations
9
October 2009Kilgarriff: FLTRP9 Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs
10
October 2009Kilgarriff: FLTRP10 Four ages of corpus lexicography
11
October 2009Kilgarriff: FLTRP11 Age 1: Pre computer Oxford English Dictionary: 5 million index cards
12
October 2009Kilgarriff: FLTRP12 Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator asian-kwic.html asian-kwic.html the coloured-pens method
13
October 2009Kilgarriff: FLTRP13 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no
14
October 2009Kilgarriff: FLTRP14 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience
15
October 2009Kilgarriff: FLTRP15 Collocation listing For right collocates of save (>5 hits) wordfr(x+y)fr(y)wordfr(x+y)fr(y) forests6170life364875 $1.26180dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141
16
October 2009Kilgarriff: FLTRP16 Collocation statistics Which words? –next word –last word –window, +1 to +5; window, -5 to -1 How sorted? most common collocates --but for most nouns it's the
17
October 2009Kilgarriff: FLTRP17 Collocation listing For right collocates of save (>5 hits) wordfr(x+y)fr(y)wordfr(x+y)fr(y) forests6170life364875 $1.26180dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141
18
October 2009Kilgarriff: FLTRP18 Age-3 collocation statistics: limitations Lists contain junk unsorted for type --MI lists mix adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation
19
October 2009Kilgarriff: FLTRP19 Age 4: The word sketch Automatic one-page summary of a word’s grammatical and collocatonal behaviour
20
October 2009Kilgarriff: FLTRP20 The Sketch Engine Input: –any corpus, any language Lemmatised, part-of-speech tagged –specification of grammatical relations Word sketches integrated with Corpus query system –Supports complex searching, sorting etc First release early 2004
21
October 2009Kilgarriff: FLTRP21 Recap: Lexicography: finding facts about words collocations grammatical patterns idioms synonyms meanings translations
22
October 2009Kilgarriff: FLTRP22 Thesaurus Also near-synonyms –are there any true synonyms? Distributional: which words share same distributions –if corpus contains, –1 pt similarity between wine and beer –gather all points; find nearest neighbours Sparck Jones, Lin, Grefenstette
23
October 2009Kilgarriff: FLTRP23 Electronic dictionaries Conference on them last week Rundell quotation On –PC –Handheld –Cellphone –Web
24
October 2009Kilgarriff: FLTRP24 On PCs CD-ROMs as added extra –No income model –Large extra publishing cost –No extra income
25
October 2009Kilgarriff: FLTRP25 Handhelds Students like them, teachers don’t –Subversive! –Fast to use: used even for conversation Many dictionaries on one device –Users usually do not know which –For publishers Complex distribution channels Dictionary publishers have little control
26
October 2009Kilgarriff: FLTRP26 Cellphones
27
October 2009Kilgarriff: FLTRP27 Web dictionaries Traditional publishers vs new players Business models –Free + premium –Advertising How many hits/month? Macmillan 2.5m Cambridge UP 30m Leo 100m
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.