Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.

Similar presentations


Presentation on theme: "1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of."— Presentation transcript:

1 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds & Sussex, UK

2 October 2009Kilgarriff: FLTRP2 Outline  Precision and recall  History of corpus lexicography  Sketch Engine –demo  Automatic Collocations Dictionary –demo  Electronic dictionaries

3 October 2009Kilgarriff: FLTRP3 Find me all the fat cats  a request for information

4 October 2009Kilgarriff: FLTRP4 High recall  Lots of responses  Maybe not all good

5 October 2009Kilgarriff: FLTRP5 High precision  Fewer hits  Higher confidence

6 October 2009Kilgarriff: FLTRP6 Us precision, them recall RecallPrecision Computers good bad People bad good

7 October 2009Kilgarriff: FLTRP7 Us precision, them recall  True in many areas –web searching, google –finding an image to illustrate a talk  Nowhere more so than lexicography

8 October 2009Kilgarriff: FLTRP8 Lexicography: finding facts about words  collocations  grammatical patterns  idioms  synonyms  antonyms  meanings  translations

9 October 2009Kilgarriff: FLTRP9 Outline  Precision and recall  History of corpus lexicography  Natural Language Processing  Cyborgs

10 October 2009Kilgarriff: FLTRP10 Four ages of corpus lexicography

11 October 2009Kilgarriff: FLTRP11 Age 1: Pre computer Oxford English Dictionary: 5 million index cards

12 October 2009Kilgarriff: FLTRP12 Age 2: KWIC Concordances  From 1980  Computerised  COBUILD project was innovator  asian-kwic.html asian-kwic.html  the coloured-pens method

13 October 2009Kilgarriff: FLTRP13 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no

14 October 2009Kilgarriff: FLTRP14 Age 3: Collocation statistics  Problem: too much data - how to summarise?  Solution: list of words occurring in neighbourhood of headword, with frequencies  Sorted by salience

15 October 2009Kilgarriff: FLTRP15 Collocation listing For right collocates of save (>5 hits) wordfr(x+y)fr(y)wordfr(x+y)fr(y) forests6170life364875 $1.26180dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141

16 October 2009Kilgarriff: FLTRP16 Collocation statistics  Which words? –next word –last word –window, +1 to +5; window, -5 to -1  How sorted?  most common collocates --but for most nouns it's the

17 October 2009Kilgarriff: FLTRP17 Collocation listing For right collocates of save (>5 hits) wordfr(x+y)fr(y)wordfr(x+y)fr(y) forests6170life364875 $1.26180dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141

18 October 2009Kilgarriff: FLTRP18 Age-3 collocation statistics: limitations Lists contain  junk  unsorted for type --MI lists mix adverbs, subjects, objects, prepositions What we really want:  noise-free lists  one list for each grammatical relation

19 October 2009Kilgarriff: FLTRP19 Age 4: The word sketch  Automatic one-page summary of a word’s grammatical and collocatonal behaviour

20 October 2009Kilgarriff: FLTRP20 The Sketch Engine  Input: –any corpus, any language  Lemmatised, part-of-speech tagged –specification of grammatical relations  Word sketches integrated with  Corpus query system –Supports complex searching, sorting etc  First release early 2004

21 October 2009Kilgarriff: FLTRP21 Recap: Lexicography: finding facts about words  collocations  grammatical patterns  idioms  synonyms  meanings  translations

22 October 2009Kilgarriff: FLTRP22 Thesaurus  Also near-synonyms –are there any true synonyms?  Distributional: which words share same distributions –if corpus contains, –1 pt similarity between wine and beer –gather all points; find nearest neighbours  Sparck Jones, Lin, Grefenstette

23 October 2009Kilgarriff: FLTRP23 Electronic dictionaries  Conference on them last week  Rundell quotation  On –PC –Handheld –Cellphone –Web

24 October 2009Kilgarriff: FLTRP24 On PCs  CD-ROMs as added extra –No income model –Large extra publishing cost –No extra income

25 October 2009Kilgarriff: FLTRP25 Handhelds  Students like them, teachers don’t –Subversive! –Fast to use: used even for conversation  Many dictionaries on one device –Users usually do not know which –For publishers  Complex distribution channels  Dictionary publishers have little control

26 October 2009Kilgarriff: FLTRP26 Cellphones

27 October 2009Kilgarriff: FLTRP27 Web dictionaries  Traditional publishers vs new players  Business models –Free + premium –Advertising  How many hits/month?  Macmillan 2.5m  Cambridge UP 30m  Leo 100m


Download ppt "1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of."

Similar presentations


Ads by Google