CS Word Sense Disambiguation
2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’? Flies [V] vs. Flies [N] He robbed the bank. He sat on the bank. How do we determine the correct ‘sense’ of the word? Machine Learning –Supervised methods Evaluation –Lightly supervised and Unsupervised Bootstrapping Dictionary-based techniques Selection restrictions Clustering
3 Supervised WSD Approaches: –Tag a corpus with correct senses of particular words (lexical sample) or all words (all-words task) E.g. SENSEVAL corpora –Lexical sample: Extract features which might predict word sense –POS? Word identity? Punctuation after? Previous word? Its POS? Use Machine Learning algorithm to produce a classifier which can predict the senses of one word or many –All-words Use semantic concordance: each open class word labeled with sense from dictionary or thesaurus
4 –E.g. SemCor (Brown Corpus), tagged with WordNet senses
5 What Features Are Useful? “Words are known by the company they keep” –How much ‘company’ do we need to look at? –What do we need to know about the ‘friends’? POS, lemmas/stems/syntactic categories,… Collocations: words that frequently appear with the target, identified from large corpora federal government, honor code, baked potato –Position is key Bag-of-words: words that appear somewhere in a context window I want to play a musical instrument so I chose the bass. –Ordering/proximity not critical
6 Punctuation, capitalization, formatting
7 Rule Induction Learners and WSD Given a feature vector of values for independent variables associated with observations of values for the training set Top-down greedy search driven by information gain: how will entropy of (remaining) data be reduced if we split on this feature? Produce a set of rules that perform best on the training data, e.g. –bank 2 if w-1==‘river’ & pos==NP & src==‘Fishing News’… –… Easy to understand result but many passes to achieve each decision, susceptible to over-fitting
8 Naïve Bayes ŝ = p(s|V), or Where s is one of the senses S possible for a word w and V the input vector of feature values for w Assume features independent, so probability of V is the product of probabilities of each feature, given s, so p(V) same for any ŝ Then
9 How do we estimate p(s) and p(v j |s)? –p(s i ) is max. likelihood estimate from a sense-tagged corpus (count(s i,w j )/count(w j )) – how likely is bank to mean ‘financial institution’ over all instances of bank? –P(v j |s) is max. likelihood of each feature given a candidate sense (count(v j,s)/count(s)) – how likely is the previous word to be ‘river’ when the sense of bank is ‘financial institution’ Calculate for each possible sense and take the highest scoring sense as the most likely choice
10 Transparent Like case statements applying tests to input in turn fish within window--> bass 1 striped bass--> bass 1 guitar within window--> bass 2 bass player--> bass 1 –Yarowsky ‘96’s approach orders tests by individual accuracy on entire training set based on log-likelihood ratio Decision List Classifiers
11 Bootstrapping I –Start with a few labeled instances of target item as seeds to train initial classifier, C –Use high confidence classifications of C on unlabeled data as training data –Iterate Bootstrapping II –Start with sentences containing words strongly associated with each sense (e.g. sea and music for bass), either intuitively or from corpus or from dictionary entries, and label those automatically –One Sense per Discourse hypothesis Bootstrapping to Get More Labeled Data
12 Evaluating WSD In vivo/end-to-end/task-based/extrinsic vs. in vitro/stand- alone/intrinsic: evaluation in some task (parsing? q/a? IVR system?) vs. application independent –In vitro metrics: classification accuracy on held-out test set or precision/recall/f-measure if not all instances must be labeled Baseline: –Most frequent sense? –Lesk algorithms Ceiling: human annotator agreement
13 Dictionary Approaches Problem of scale for all ML approaches –Building a classifier for each word with multiple senses Machine-Readable dictionaries with senses identified and examples –Simplified Lesk: Retrieve all content words occurring in context of target (e.g. Sailors love to fish for bass.) –Compute overlap with sense definitions of target entry »bass 1 : a musical instrument… »bass 2 : a type of fish that lives in the sea…
14 –Choose sense with most content-word overlap –Original Lesk: Compare dictionary entries of all content-words in context with entries for each sense Limits: –Dictionary entries are short; performance best with longer entries, so…. Expand with entries of ‘related’ words that appear in the entry If tagged corpus available, collect all the words appearing in context of each sense of target word (e.g. all words appearing in sentences with bass 1 ) to signature for bass 1 –Weight each by frequency of occurrence in all ‘documents’ (e.g. all senses of bass) to capture how discriminating a word is for the target word’s senses –Corpus Lesk performs best of all Lesk approaches
15 Summary Many useful approaches developed to do WSD –Supervised and unsupervised ML techniques –Novel uses of existing resources (WN, dictionaries) Future –More tagged training corpora becoming available –New learning techniques being tested, e.g. co-training Next class: –Ch 18:6-9