Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Similar presentations


Presentation on theme: "1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex."— Presentation transcript:

1 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

2 Geneva, April 2010 Adam Kilgarriff 2 Overview Research programme Examples: Case study Word sketching Evaluating word sketches

3 Geneva, April 2010 Adam Kilgarriff 3 What is language?

4 Geneva, April 2010 Adam Kilgarriff 4 What is language? In our heads

5 Geneva, April 2010 Adam Kilgarriff 5 What is language? In our heads In texts and sound signals

6 Geneva, April 2010 Adam Kilgarriff 6 What is language? In our heads In texts and sound signals Both

7 Geneva, April 2010 Adam Kilgarriff 7 Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz)‏

8 Geneva, April 2010 Adam Kilgarriff 8 Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz)‏ Odd method for objective science Practical problems: coverage, arbitrariness

9 Geneva, April 2010 Adam Kilgarriff 9 Methodology Study text “empiricist” (Locke, Hume)‏ Physics: forces, matter Chemistry: chemicals, bonds Language: text, speech signals

10 Geneva, April 2010 Adam Kilgarriff 10 It goes against the grain What is important about a sentence? its meaning Corpus methodology: Throw away individual sentence meaning Find patterns

11 Geneva, April 2010 Adam Kilgarriff 11 Twenty years of rapid ascent Computer power Corpora bigger and bigger data sets Language technology tools lemmatizers, POS-taggers, parsers Machine learning, pattern-finding

12 Geneva, April 2010 Adam Kilgarriff 12 A virtuous circle Pattern finding Linguistic processing Corpus Lexicon Part-of-speech tagging Parsing Lemmatizing More data → gets richer each time round

13 Geneva, April 2010 Adam Kilgarriff 13 Case study: corpus lexicography - four ages

14 Geneva, April 2010 Adam Kilgarriff 14 Age 1: Pre-computer Oxford English Dictionary: 20 million index cards

15 Geneva, April 2010 Adam Kilgarriff 15 Age 2: KWIC Concordances From 1980 Computerised

16 Geneva, April 2010 Adam Kilgarriff 16 Age 2: KWIC Concordance

17 Geneva, April 2010 Adam Kilgarriff 17 Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator the coloured-pens method

18 Geneva, April 2010 Adam Kilgarriff 18 The coloured pens method

19 Geneva, April 2010 Adam Kilgarriff 19 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: read all 500 lines: could read all, takes a long time 5000 lines: no

20 Geneva, April 2010 Adam Kilgarriff 20 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience

21 Geneva, April 2010 Adam Kilgarriff 21 Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword word yourmoney estimatedjobs faceannually thousandsenormous costslives dollars$1.2 lifeforests

22 Geneva, April 2010 Adam Kilgarriff 22 Age 4: The word sketch A corpus-derived one-page summary of a word’s grammatical and collocational behaviour

23 Geneva, April 2010 Adam Kilgarriff 23 Age 4: The word sketch Large corpus Parse to find subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before

24 Geneva, April 2010 Adam Kilgarriff 24 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002

25 Geneva, April 2010 Adam Kilgarriff 25 Euralex 2002

26 Geneva, April 2010 Adam Kilgarriff 26 Euralex 2002 Can I have them for my language please

27 Geneva, April 2010 Adam Kilgarriff 27 The Sketch Engine Input: any corpus, any language Lemmatised, part-of-speech tagged specification of grammatical relations Word sketches integrated with corpus query system Developer: Pavel Rychly, Brno

28 Geneva, April 2010 Adam Kilgarriff 28 Users: Dictionary publishers Oxford UP, Collins, Chambers, Macmillan, Cambridge UP Universities Teaching, research Framenet Language teaching http://www.sketchengine.co.uk/ Self-registration for free trial account

29 Geneva, April 2010 Adam Kilgarriff 29 Lexical Computing Ltd Since 2003 Directors Adam Kilgarriff (UK), Pavel Rychly (Cz), Diana McCarthy (UK, since Oct 2009)‏ Main activities Sketch engine service Corpus development Research-led

30 Geneva, April 2010 Adam Kilgarriff 30 (demo)‏

31 Geneva, April 2010 Adam Kilgarriff 31 Evaluating word sketches 10 years 1999-2009 Feedback Good but anecdotal Formal evaluation

32 Geneva, April 2010 Adam Kilgarriff 32 Goal Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality Ask a lexicographer For 42 headwords For 20 best collocates per headwords “should we include this collocation in a published dictionary?”

33 Geneva, April 2010 Adam Kilgarriff 33 Sample of headwords Nouns verbs adjectives, random High (Top 3000)‏ N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999)‏ N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000)‏ N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable

34 Geneva, April 2010 Adam Kilgarriff 34 Precision and recall We test precision Recall is harder How do we find all the collocations that the system should have found? Current work 200 collocates per headword Selected from All the corpora we have Various parameter settings Plus just-in-time evaluation for 'new' collocates

35 Geneva, April 2010 Adam Kilgarriff 35 Four languages, three families Dutch ANW, 102m-word lexicographic corpus English UKWaC, 1.5b web corpus Japanese JpWaC, 400m web corpus Slovene FidaPlus, 620m lexicographic corpus

36 Geneva, April 2010 Adam Kilgarriff 36 User evaluation Evaluate whole system Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation Can I make the system better? Evaluate each module separately Current work

37 Geneva, April 2010 Adam Kilgarriff 37 Components Corpus NLP tools Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics

38 Geneva, April 2010 Adam Kilgarriff 38 Practicalities Interface Good, Good-but Merge to good Maybe, Maybe-specialised, Bad Merge to bad For each language Two/three linguists/lexicographers If they disagree Don't use for computing performance

39 Geneva, April 2010 Adam Kilgarriff 39 Results Dutch 66% English71% Japanese87% Slovene71%

40 Geneva, April 2010 Adam Kilgarriff 40 Corpus evaluation Collocation-finding Typical corpus task Recall Hold all else constant Statistic, NLP tools, grammar Best results: best corpus (for collocation-finding)‏ Pomikalek: de-duplication

41 Geneva, April 2010 Adam Kilgarriff 41 Other topics Dante a new lexical database for English Corpus building (mostly from the web)‏ Instant corpora with WebBootCaT Bigger and better (English)‏ BiWeC and New Model Corpus Corpus Factory (many languages)‏ Corpus comparison, similarity, evaluation Statistics: collocations, keyword lists Word frequency lists Word senses and lexicography SADD: semi-automatic dictionary drafting

42 Geneva, April 2010 Adam Kilgarriff 42 Thank you http://www.sketchengine.co.uk

43 Geneva, April 2010 Adam Kilgarriff 43 Words and word senses automatic thesauruses words

44 Geneva, April 2010 Adam Kilgarriff 44 Words and word senses automatic thesauruses words manual thesauruses simple hierarchy is appealing homonyms

45 Geneva, April 2010 Adam Kilgarriff 45 Words and word senses automatic thesauruses words manual thesauruses simple hierarchy is appealing homonyms “aha! objects must be word senses”

46 Geneva, April 2010 Adam Kilgarriff 46 Problems Theoretical Practical

47 Geneva, April 2010 Adam Kilgarriff 47 Theoretical

48 Geneva, April 2010 Adam Kilgarriff 48

49 Geneva, April 2010 Adam Kilgarriff 49

50 Geneva, April 2010 Adam Kilgarriff 50 Wittgenstein Don’t ask for the meaning, ask for the use

51 Geneva, April 2010 Adam Kilgarriff 51 Practical

52 Geneva, April 2010 Adam Kilgarriff 52 Problems Practical a thesaurus is a tool if the tool organises words senses you must do WSD before you can use it WSD: state of the art, optimal conditions: 80%

53 Geneva, April 2010 Adam Kilgarriff 53 Problems “To use this tool, first replace one fifth of your input with junk”

54 Geneva, April 2010 Adam Kilgarriff 54 Avoid word senses

55 Geneva, April 2010 Adam Kilgarriff 55 Avoid word senses This word has three meanings/senses

56 Geneva, April 2010 Adam Kilgarriff 56 Avoid word senses This word has three meanings/senses This word has three kinds of use well founded empirical we can build on it


Download ppt "1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex."

Similar presentations


Ads by Google