GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing Ltd, UK Masaryk University, Czech Rep A&C Black Publishers Ltd., UK Macmillan Education, UK Lexicography MasterClass Ltd., UK
Users appreciate examples Paper: space constraints Electronic: no space constraints Give lots of examples Constraint: Cost of selection, editing
Project Macmillan English dictionary Licensing arrangement with A&C Black Already had 1000 collocation boxes See collocationality paper, ELX 2006 Average 8 per box New electronic version All 8000 collocations need examples Authentic; from corpus
Old method Lexicographer Gets concordance for collocation Reads through until they find a good example Cut, paste, edit
New method Lexicographer Gets sorted concordance 20 best examples in spreadsheet Less reading through Tick the first good one, edit
What makes a good example? Readable EFL users Informative Typical, for the collocation Gives context which helps user understand the target word/phrase
Readability 70 years research Not just (or mainly) EFL Educational theory Teaching children to read Instruction manuals Publishing
Readability tests Fleish Reading Ease test (1948) Ave sentence length, ave word length In some word processing software Many similar measures Recent work Language modelling from training data Target levels US grades Common European Framwork
GDEX Get concordance for collocation For each sentence Score it Sort Show best ones
GDEX heuristics Sentence length (10-26 words) Mostly common words: good Rare words: bad Sentences Start with capital, end with one of.!? No [, ],, http, \ Penalise: Other punctuation, numbers More than 2 or 3 capitals Typicality: third collocate is a plus
Weighting For each sentence Score on each heuristic Weight scores Add together weighted score How to set weights?
Machine learning Two students: Manually judged 1000 “good examples” Weights set to mimic students´ choices
Was it successful? Did it save lexicographer time? Definitely (says project manager) Corpus choice Started with BNC but Too old Not enough examples If no good examples in corpus, GDEX can’t help Changed to UKWaC 20 times bigger; from web; contemporary Better Most web junk filtered out Usually a good example in top twenty
GDEX and TALC TALC Teaching and Language Corpora Goal: bring corpora into lg teaching Usual problem Concordances are tough for learners to read Way forward GDEX examples Half way between dictionary and corpus
GDEX: Models for use More examples for dictionaries Speed up, as with MED or Fully automatic “more examples” Corpus query tool Sort concordances, best first Now an option in the Sketch Engine Automatic collocations dictionary