WSD using Optimized Combination of Knowledge Sources Authors: Yorick Wilks and Mark Stevenson Presenter: Marian Olteanu
Introduction Regular approaches All words Sample (small trial section) Problems Ambiguity, especially at fine granularity New senses in text that are not in dictionary
Approach Integrates partial sources of information Part-of-speech Dictionary definitions Pragmatic codes Selectional restrictions Integration Filters Partial selectors (taggers)
Dictionary for senses Longman Dictionary of Contemporary English (LDOCE) Two levels: Homograph Sense
Methodology Preprocessing Part-of-speech tagger (Brill) Part-of-speech Filter – eliminate all incompatible homographs If no sense remains – keep all senses
Methodology (cont.) Dictionary definitions Partial tagger: Count number of words that appear both in definition and the context Normalize by the length of the definition Return a list of candidate senses
Methodology (cont.) Pragmatic codes Partial tagger - Uses the hierarchy of LDOCE pragmatic codes (subject area) Modified simulated annealing Optimize the number of pragmatic codes of the same type in the sentence Whole paragraph - Only for nouns ?
Methodology (cont.) Selectional Restrictions Filter LDOCE senses – 35 semantic classes (H = human, M = human male, P = plant, etc) Nouns – their type, adjs – the type of the object they modify, adv – type of their modifier, verbs – types of S, DO, IO
Methodology (cont.) Combine knowledge sources Decision lists Can assign sense to unknown words, if there is a definition in LDOCE
Evaluation Create a corpus based on SemCor (200,000 words; tagged with WordNet senses) SENSUS – merging between LDOCE and WordNet (for Machine Translation) Still ambiguity 36,869 out of 85,747 words (personal opinion: strongly biased)
Results Baseline: 49.8% 70% of the 1 st sense – correctly tagged 83.4% accuracy = 92.8% accuracy on all words (!!!) Test by voting: