Evaluating the Waspbench A Lexicography Tool Incorporating Word Sense Disambiguation Rob Koeling, Adam Kilgarriff, David Tugwell, Roger Evans ITRI, University of Brighton Credits: UK EPSRC grant WASPS, M34971
Lexicographers need NLP
NLP needs lexicography
Word senses: nowhere truer Lexicography – the second hardest part
Word senses: nowhere truer Lexicography – the second hardest part NLP –Word sense disambiguation (WSD) SENSEVAL-1 (1998): 77% Hector SENSEVAL-2 (2001): 64% WordNet
Word senses: nowhere truer Lexicography – the second hardest part NLP –Word sense disambiguation (WSD) SENSEVAL-1 (1998): 77% Hector SENSEVAL-2 (2001): 64% WordNet –Machine Translation Main cost is lexicography
Synergy The WASPBENCH
Inputs and outputs Inputs –Corpus (processed) –Lexicographic expertise
Inputs and outputs Outputs –Analysis of meaning/translation repertoire –Implemented: Word expert Can disambiguate A “disambiguating dictionary”
Inputs and outputs MT needs rules of form in context C, S => T –Major determinant of MT quality –Manual production: expensive –Eng oil => Fr huile or petrole? SYSTRAN: 400 rules
Inputs and outputs MT needs rules of form in context C, S => T –Major determinant of MT quality –Manual production: expensive –Eng oil => Fr huile or petrole? SYSTRAN: 400 rules Waspbench output: thousands of rules
Evaluation hard
Evaluation hard Three communities
Evaluation hard Three communities No precedents
Evaluation hard Three communities No precedents The art and craft of lexicography
Evaluation hard Three communities No precedents The art and craft of lexicography MT personpower budgets
Five threads as WSD: SENSEVAL for lexicography: MED expert reports Quantitative experiments with human subjects –India Within-group consistency –Leeds Comparison with commercial MT
Method Human1 creates word experts Computer uses word experts to disambiguate test instances MT system translates same test instances Human2 –evaluates computer and MT performance on each instance: –good / bad / unsure / preferred / alternative
Words mid-frequency –1,500-20,000 instances in BNC At least two clearly distinct meanings –Checked with ref to translations into Fr/Ger/Dutch 33 words –16 nouns, 10 verbs, 7 adjs around 40 test instances per word
Words NounsVerbsAdjectives bank partycharge toastbright chest policyfloat underminefree coat recordmovefunny fit sealobservehot line stepoffendmoody lot termpoststrong mass volumepray
Human subjects Translation studies students, Univ Leeds –Thanks: Tony Hartley Native/near-native in English and their other language twelve people, working with: –Chinese (4) French (3) German (2) Italian (1) Japanese (2) (no MT system for Japanese) circa four days’ work: –introduction/training –two days to create word experts –two days to evaluate output
Method Human1 creates word experts, average 30 mins/word Computer uses word experts to disambiguate test instances MT system: Babelfish via Altavista translates same test instances Human2 –evaluates computer and MT performance on each instance: –good / bad / unsure / preferred / alternative
Results (%) LangWaspsMTbothneitherunsure Ger Fr Ch It All
Results by POS (%) WaspsMTbothneither Nouns Verbs Adjs
Observations Grad student users, 4-hour training 30 mins per (not-too-complex) word ‘fuzzy’ words intrinsically harder No great inter-subject disparities –(it’s the words that vary, not the people)
Conclusion WSD can improve MT (using a tool like WASPS)
Future work multiwords n>2 thesaurus other source languages new corpora, bigger corpora –the web