Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov 2011 1.

Similar presentations


Presentation on theme: "Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov 2011 1."— Presentation transcript:

1 Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov 2011 1

2 Then Very few corpora Use what’s there Now Corpora to spec Choice Need to evaluate Corpus evaluation 2

3 Portsmouth Nov 2011 Intrinsic See what it looks like Extrinsic Embed in a task How well do you do at the task Better It all depends what you want it for Corpus evaluation 3

4 Portsmouth Nov 2011 it all depends what you want it for but ‘general English (/French/Chinese/ …)’ Many purposes Not specialist sublanguage A decent construct? Not sure but it has form General language dictionaries “how good is a corpus, for making them?” Corpus evaluation 4

5 Portsmouth Nov 2011 General truths Duplicates bad Noise bad Big good Diverse (good coverage of varieties within research scope, not dominated by any one variety) good Corpus evaluation 5

6 Portsmouth Nov 2011 Corpus evaluation word sketch A corpus-derived one-page summary of a word’s grammatical and collocational behaviour 6

7 Portsmouth Nov 2011 Corpus evaluation Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002 7

8 Portsmouth Nov 2011 Corpus evaluation 11 years 1999-2010 Feedback Good but anecdotal Formal evaluation 8

9 Portsmouth Nov 2011 Corpus evaluation Goal Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality Ask a lexicographer For 42 headwords For 20 best collocates per headwords “should we include this collocation in a published dictionary?” 9

10 Portsmouth Nov 2011 Corpus evaluation Sample of headwords Nouns verbs adjectives, random High (Top 3000)‏ N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999)‏ N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000)‏ N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable 10

11 Portsmouth Nov 2011 Corpus evaluation Precision and recall We tested precision Recall is harder How do we find all the collocations that the system should have found? 11

12 Portsmouth Nov 2011 Corpus evaluation Four languages, three families Dutch ANW, 102m-word lexicographic corpus English UKWaC, 1.5b web corpus Japanese JpWaC, 400m web corpus Slovene FidaPlus, 620m lexicographic corpus 12

13 Portsmouth Nov 2011 Corpus evaluation User evaluation Evaluate whole system Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation Can I make the system better? Evaluate each module separately Current work 13

14 Portsmouth Nov 2011 Corpus evaluation Components Corpus NLP tools Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics 14

15 Portsmouth Nov 2011 Corpus evaluation Practicalities Interface Good, Good-but Merge to good Maybe, Maybe-specialised, Bad Merge to bad For each language Two/three linguists/lexicographers If they disagree Don't use for computing performance 15

16 Portsmouth Nov 2011 Corpus evaluation Results Dutch 66% English71% Japanese87% Slovene71% Two thirds of a collocations dictionary can be gathered automatically 16

17 Portsmouth Nov 2011 problem Is it good? Superficially no Look at concordances: World cup finals Solution ‘Commonest string’ Corpus evaluation 17

18 Portsmouth Nov 2011 Corpus evaluation Next step Recall 200 collocates per headword Selected from All the corpora we have Various parameter settings Plus just-in-time evaluation for 'new' collocates Then For a sample of headwords These are the collocations we should get 18

19 Portsmouth Nov 2011 From sketches to corpora Hold other inputs constant Just one varies Evaluate that one Hold tools, stats, grammar constant evaluate corpora Corpus evaluation 19

20 Portsmouth Nov 2011 Criteria Duplicates bad Noise bad Big good Diverse (good coverage of varieties within research scope, not dominated by any one variety) good We think so Corpus evaluation 20

21 Portsmouth Nov 2011 Over next year Build test sets Textbook cases English BNC vs UKWaC vs OEC vs Gigaword Dutch ANW corpus vs web corpus web crawling, deduplication Which parameters give best results? Corpus evaluation 21

22 Portsmouth Nov 2011 Corpus evaluation Thank you http://www.sketchengine.co.uk 22


Download ppt "Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov 2011 1."

Similar presentations


Ads by Google