GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1
Auckland 2012Kilgarriff: GDEX2 Users appreciate examples Paper: space constraints Electronic: no space constraints Give lots of examples Constraint: Cost of selection, editing
Auckland 2012Kilgarriff: GDEX3 Project Macmillan English dictionary Already had 1000 collocation boxes Average 8 per box New electronic version All 8000 collocations need examples Authentic; from corpus
Auckland 2012Kilgarriff: GDEX4 Old method Lexicographer Gets concordance for collocation Reads through until they find a good example Cut, paste, edit
Auckland 2012Kilgarriff: GDEX5 New method Lexicographer Gets sorted concordance 20 best examples in spreadsheet Less reading through Tick the first good one, edit
Auckland 2012Kilgarriff: GDEX6 What makes a good example? Readable EFL users Informative Typical, for the collocation Gives context which helps user understand the target word/phrase
Auckland 2012Kilgarriff: GDEX7 Readability 70 years research Not just (or mainly) EFL Educational theory Teaching children to read Instruction manuals Early work: US military Publishing People like newspapers and magazines that they find easy to read
Auckland 2012Kilgarriff: GDEX8 Readability tests Fleish Reading Ease test 1948 Ave sentence length, ave word length In some word processing software Many similar measures Recent work training data for different reading levels Language modelling Target levels US grades Now, increasingly: Common European Framwork
Auckland 2012Kilgarriff: GDEX9 GDEX Get concordance for collocation For each sentence Score it Sort Show best ones to lexicographer
Auckland 2012Kilgarriff: GDEX10 GDEX heuristics Sentence length (10-26 words) Mostly common words is good Rare words are bad Sentences Start with capital, end with one of.!? No [, ],, http, \ Not much other punctuation, numbers Not too many capitals Typicality: third collocate is a plus
Auckland 2012Kilgarriff: GDEX11 Weighting For each sentence Score on each heuristic Weight scores Add together weighted score How to set weights? Two students: Manually judged 1000 “ good examples ” Weights set so system makes same choices as students
Auckland 2012Kilgarriff: GDEX12 Was it successful? Did it save lexicographer time? Definitely (says project manager) Rough guess Average number of corpus lines to read until you find a good one: Unsorted: 20 Sorted: 5
Auckland 2012Kilgarriff: GDEX13 Corpus choice Started with BNC but Too old Not enough examples If no good examples in corpus, GDEX can ’ t help Changed to UKWaC 20 times bigger; from web; contemporary Better Most web junk filtered out Usually a good example in top twenty
Auckland 2012Kilgarriff: GDEX14 GDEX and TALC TALC (Teaching and Language Corpora) Goal: bring corpora into lg teaching Usual problem Concordances are tough for learners to read Way forward GDEX examples Half way between dictionary and corpus
Auckland 2012Kilgarriff: GDEX15 GDEX: Models for use More examples for dictionaries Speed up, as with MED or Fully automatic “ more examples ” Corpus query tool Sort concordances, best first Now an option in the Sketch Engine Automatic collocations dictionary
Recent developments Configurable GDEX For other languages Interface to help set up Commonest string Between ‘bare collocate’ and example Auckland 2012Kilgarriff: GDEX16
Auckland 2012Kilgarriff: GDEX17