Download presentation
Presentation is loading. Please wait.
Published byKellie Harrington Modified over 9 years ago
1
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex
2
Geneva, April 2010 Adam Kilgarriff 2 Overview Research programme Examples: Case study Word sketching Evaluating word sketches
3
Geneva, April 2010 Adam Kilgarriff 3 What is language?
4
Geneva, April 2010 Adam Kilgarriff 4 What is language? In our heads
5
Geneva, April 2010 Adam Kilgarriff 5 What is language? In our heads In texts and sound signals
6
Geneva, April 2010 Adam Kilgarriff 6 What is language? In our heads In texts and sound signals Both
7
Geneva, April 2010 Adam Kilgarriff 7 Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz)
8
Geneva, April 2010 Adam Kilgarriff 8 Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz) Odd method for objective science Practical problems: coverage, arbitrariness
9
Geneva, April 2010 Adam Kilgarriff 9 Methodology Study text “empiricist” (Locke, Hume) Physics: forces, matter Chemistry: chemicals, bonds Language: text, speech signals
10
Geneva, April 2010 Adam Kilgarriff 10 It goes against the grain What is important about a sentence? its meaning Corpus methodology: Throw away individual sentence meaning Find patterns
11
Geneva, April 2010 Adam Kilgarriff 11 Twenty years of rapid ascent Computer power Corpora bigger and bigger data sets Language technology tools lemmatizers, POS-taggers, parsers Machine learning, pattern-finding
12
Geneva, April 2010 Adam Kilgarriff 12 A virtuous circle Pattern finding Linguistic processing Corpus Lexicon Part-of-speech tagging Parsing Lemmatizing More data → gets richer each time round
13
Geneva, April 2010 Adam Kilgarriff 13 Case study: corpus lexicography - four ages
14
Geneva, April 2010 Adam Kilgarriff 14 Age 1: Pre-computer Oxford English Dictionary: 20 million index cards
15
Geneva, April 2010 Adam Kilgarriff 15 Age 2: KWIC Concordances From 1980 Computerised
16
Geneva, April 2010 Adam Kilgarriff 16 Age 2: KWIC Concordance
17
Geneva, April 2010 Adam Kilgarriff 17 Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator the coloured-pens method
18
Geneva, April 2010 Adam Kilgarriff 18 The coloured pens method
19
Geneva, April 2010 Adam Kilgarriff 19 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: read all 500 lines: could read all, takes a long time 5000 lines: no
20
Geneva, April 2010 Adam Kilgarriff 20 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience
21
Geneva, April 2010 Adam Kilgarriff 21 Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword word yourmoney estimatedjobs faceannually thousandsenormous costslives dollars$1.2 lifeforests
22
Geneva, April 2010 Adam Kilgarriff 22 Age 4: The word sketch A corpus-derived one-page summary of a word’s grammatical and collocational behaviour
23
Geneva, April 2010 Adam Kilgarriff 23 Age 4: The word sketch Large corpus Parse to find subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before
24
Geneva, April 2010 Adam Kilgarriff 24 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002
25
Geneva, April 2010 Adam Kilgarriff 25 Euralex 2002
26
Geneva, April 2010 Adam Kilgarriff 26 Euralex 2002 Can I have them for my language please
27
Geneva, April 2010 Adam Kilgarriff 27 The Sketch Engine Input: any corpus, any language Lemmatised, part-of-speech tagged specification of grammatical relations Word sketches integrated with corpus query system Developer: Pavel Rychly, Brno
28
Geneva, April 2010 Adam Kilgarriff 28 Users: Dictionary publishers Oxford UP, Collins, Chambers, Macmillan, Cambridge UP Universities Teaching, research Framenet Language teaching http://www.sketchengine.co.uk/ Self-registration for free trial account
29
Geneva, April 2010 Adam Kilgarriff 29 Lexical Computing Ltd Since 2003 Directors Adam Kilgarriff (UK), Pavel Rychly (Cz), Diana McCarthy (UK, since Oct 2009) Main activities Sketch engine service Corpus development Research-led
30
Geneva, April 2010 Adam Kilgarriff 30 (demo)
31
Geneva, April 2010 Adam Kilgarriff 31 Evaluating word sketches 10 years 1999-2009 Feedback Good but anecdotal Formal evaluation
32
Geneva, April 2010 Adam Kilgarriff 32 Goal Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality Ask a lexicographer For 42 headwords For 20 best collocates per headwords “should we include this collocation in a published dictionary?”
33
Geneva, April 2010 Adam Kilgarriff 33 Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable
34
Geneva, April 2010 Adam Kilgarriff 34 Precision and recall We test precision Recall is harder How do we find all the collocations that the system should have found? Current work 200 collocates per headword Selected from All the corpora we have Various parameter settings Plus just-in-time evaluation for 'new' collocates
35
Geneva, April 2010 Adam Kilgarriff 35 Four languages, three families Dutch ANW, 102m-word lexicographic corpus English UKWaC, 1.5b web corpus Japanese JpWaC, 400m web corpus Slovene FidaPlus, 620m lexicographic corpus
36
Geneva, April 2010 Adam Kilgarriff 36 User evaluation Evaluate whole system Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation Can I make the system better? Evaluate each module separately Current work
37
Geneva, April 2010 Adam Kilgarriff 37 Components Corpus NLP tools Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics
38
Geneva, April 2010 Adam Kilgarriff 38 Practicalities Interface Good, Good-but Merge to good Maybe, Maybe-specialised, Bad Merge to bad For each language Two/three linguists/lexicographers If they disagree Don't use for computing performance
39
Geneva, April 2010 Adam Kilgarriff 39 Results Dutch 66% English71% Japanese87% Slovene71%
40
Geneva, April 2010 Adam Kilgarriff 40 Corpus evaluation Collocation-finding Typical corpus task Recall Hold all else constant Statistic, NLP tools, grammar Best results: best corpus (for collocation-finding) Pomikalek: de-duplication
41
Geneva, April 2010 Adam Kilgarriff 41 Other topics Dante a new lexical database for English Corpus building (mostly from the web) Instant corpora with WebBootCaT Bigger and better (English) BiWeC and New Model Corpus Corpus Factory (many languages) Corpus comparison, similarity, evaluation Statistics: collocations, keyword lists Word frequency lists Word senses and lexicography SADD: semi-automatic dictionary drafting
42
Geneva, April 2010 Adam Kilgarriff 42 Thank you http://www.sketchengine.co.uk
43
Geneva, April 2010 Adam Kilgarriff 43 Words and word senses automatic thesauruses words
44
Geneva, April 2010 Adam Kilgarriff 44 Words and word senses automatic thesauruses words manual thesauruses simple hierarchy is appealing homonyms
45
Geneva, April 2010 Adam Kilgarriff 45 Words and word senses automatic thesauruses words manual thesauruses simple hierarchy is appealing homonyms “aha! objects must be word senses”
46
Geneva, April 2010 Adam Kilgarriff 46 Problems Theoretical Practical
47
Geneva, April 2010 Adam Kilgarriff 47 Theoretical
48
Geneva, April 2010 Adam Kilgarriff 48
49
Geneva, April 2010 Adam Kilgarriff 49
50
Geneva, April 2010 Adam Kilgarriff 50 Wittgenstein Don’t ask for the meaning, ask for the use
51
Geneva, April 2010 Adam Kilgarriff 51 Practical
52
Geneva, April 2010 Adam Kilgarriff 52 Problems Practical a thesaurus is a tool if the tool organises words senses you must do WSD before you can use it WSD: state of the art, optimal conditions: 80%
53
Geneva, April 2010 Adam Kilgarriff 53 Problems “To use this tool, first replace one fifth of your input with junk”
54
Geneva, April 2010 Adam Kilgarriff 54 Avoid word senses
55
Geneva, April 2010 Adam Kilgarriff 55 Avoid word senses This word has three meanings/senses
56
Geneva, April 2010 Adam Kilgarriff 56 Avoid word senses This word has three meanings/senses This word has three kinds of use well founded empirical we can build on it
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.