Word Sense Disambiguation September 27, 2006 11/11/2018
Word-Sense Disambiguation Word sense disambiguation refers to the process of selecting the right sense for a word from among the senses that the word is known to have Semantic selection restrictions can be used to disambiguate Ambiguous arguments to unambiguous predicates Ambiguous predicates with unambiguous arguments Ambiguity all around 11/11/2018
Word-Sense Disambiguation We can use selectional restrictions for disambiguation. He cooked simple dishes. He broke the dishes. But sometimes, selectional restrictions will not be enough to disambiguate. What kind of dishes do you recommend? -- we cannot know what sense is used. There can be two lexemes (or more) with multiple senses. They serve vegetarian dishes. Selectional restrictions may block the finding of meaning. If you want to kill Turkey, eat its banks. Kafayı yedim. These situations leave the system with no possible meanings, and they can indicate a metaphor. 11/11/2018
WSD Approaches Disambiguation based on manually created rules Disambiguation using machine readable dictionaries Disambiguation using thesauri Disambiguation based on unsupervised machine learning with corpora 11/11/2018
Disambiguation based on manually created rules Weiss’ approach [Lesk 1988] : set of rules to disambiguate five words context rule: within 5 words template rule: specific location accuracy : 90% IR improvement: 1% Small & Rieger’s approach [Small 1982] : Expert system 11/11/2018
WSD and Selection Restrictions Ambiguous arguments Prepare a dish Wash a dish Ambiguous predicates Serve Denver Serve breakfast Both Serves vegetarian dishes 11/11/2018
WSD and Selection Restrictions This approach is complementary to the compositional analysis approach. You need a parse tree and some form of predicate-argument analysis derived from The tree and its attachments All the word senses coming up from the lexemes at the leaves of the tree Ill-formed analyses are eliminated by noting any selection restriction violations 11/11/2018
Problems As we saw last time, selection restrictions are violated all the time. This doesn’t mean that the sentences are ill-formed or preferred less than others. This approach needs some way of categorizing and dealing with the various ways that restrictions can be violated 11/11/2018
Can we take a more statistical approach? How likely is dish/crockery to be the object of serve? dish/food? A simple approach (baseline): predict the most likely sense Why might this work? When will it fail? A better approach: learn from a tagged corpus What needs to be tagged? An even better approach: Resnik’s selectional association (1997, 1998) Estimate conditional probabilities of word senses from a corpus tagged only with verbs and their arguments (e.g. ragout is an object of served -- Jane served/V ragout/Obj 11/11/2018
How do we get the word sense probabilities? For each verb object (e.g. ragout) Look up hypernym classes in WordNet Distribute “credit” for this object sense occurring with this verb among all the classes to which the object belongs Brian served/V the dish/Obj Jane served/V food/Obj If ragout has N hypernym classes in WordNet, add 1/N to each class count (including food) as object of serve If tureen has M hypernym classes in WordNet, add 1/M to each class count (including dish) as object of serve Pr(Class|v) is the count(c,v)/count(v) How can this work? Ambiguous words have many superordinate classes John served food/the dish/tuna/curry There is a common sense among these which gets “credit” in each instance, eventually dominating the likelihood score 11/11/2018
To determine most likely sense of ‘bass’ in Bill served bass Having previously assigned ‘credit’ for the occurrence of all hypernyms of things like fish and things like musical instruments to all their hypernym classes (e.g. ‘fish’ and ‘musical instruments’) Find the hypernym classes of bass (including fish and musical instruments) Choose the class C with the highest probability, given that the verb is serve Results: Baselines: random choice of word sense is 26.8% choose most frequent sense (NB: requires sense-labeled training corpus) is 58.2% Resnik’s: 44% correct with only pred/arg relations labeled 11/11/2018
Machine Learning Approaches Learn a classifier to assign one of possible word senses for each word Acquire knowledge from labeled or unlabeled corpus Human intervention only in labeling corpus and selecting set of features to use in training Input: feature vectors Target (dependent variable) Context (set of independent variables) Output: classification rules for unseen text 11/11/2018
WSD Tags A dictionary sense? What’s a tag? For example, for WordNet an instance of “bass” in a text has 8 possible tags or labels (bass1 through bass8). 11/11/2018
WordNet Bass The noun ``bass'' has 8 senses in WordNet bass - (the lowest part of the musical range) bass, bass part - (the lowest part in polyphonic music) bass, basso - (an adult male singer with the lowest voice) sea bass, bass - (flesh of lean-fleshed saltwater fish of the family Serranidae) freshwater bass, bass - (any of various North American lean-fleshed freshwater fishes especially of the genus Micropterus) bass, bass voice, basso - (the lowest adult male singing voice) bass - (the member with the lowest range of a family of musical instruments) bass -(nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes) 11/11/2018
Representations Most supervised ML approaches require a very simple representation for the input training data. Vectors of sets of feature/value pairs i.e. files of comma-separated values So our first task is to extract training data from a corpus with respect to a particular instance of a target word This typically consists of a characterization of the window of text surrounding the target 11/11/2018
Representations This is where ML and NLP intersect If you stick to trivial surface features that are easy to extract from a text, then most of the work is in the ML system If you decide to use features that require more analysis (say parse trees) then the ML part may be doing less work (relatively) if these features are truly informative 11/11/2018
Surface Representations Collocational and co-occurrence information Collocational Encode features about the words that appear in specific positions to the right and left of the target word Often limited to the words themselves as well as they’re part of speech Co-occurrence Features characterizing the words that occur anywhere in the window regardless of position Typically limited to frequency counts 11/11/2018
Collocational Position-specific information about the words in the window guitar and bass player stand [guitar, NN, and, CJC, player, NN, stand, VVB] In other words, a vector consisting of [position n word, position n part-of-speech…] 11/11/2018
Co-occurrence Information about the words that occur within the window. First derive a set of terms to place in the vector. Then note how often each of those terms occurs in a given window. 11/11/2018
Supervised Learning Training and test sets with words labeled as to correct sense (It was the biggest [fish: bass] I’ve seen.) Obtain values of independent variables automatically (POS, co-occurrence information, …) Run classifier on training data Test on test data Result: Classifier for use on unlabeled data 11/11/2018
Input Features for WSD POS tags of target and neighbors Surrounding context words (stemmed or not) Punctuation, capitalization and formatting Partial parsing to identify thematic/grammatical roles and relations Collocational information: How likely are target and left/right neighbor to co-occur Co-occurrence of neighboring words Intuition: How often does sea or words with bass 11/11/2018
How do we proceed? Input to learner, e.g. Is the bass fresh today? Look at a window around the word to be disambiguated, in training data Which features accurately predict the correct tag? Can you think of other features might be useful in general for WSD? Input to learner, e.g. Is the bass fresh today? [w-2, w-2/pos, w-1,w-/pos,w+1,w+1/pos,w+2,w+2/pos… [is,V,the,DET,fresh,RB,today,N... 11/11/2018
Classifiers Once we cast the WSD problem as a classification problem, then all sorts of techniques are possible Naïve Bayes (the right thing to try first) Decision lists Decision trees Neural nets Support vector machines Nearest neighbor methods… 11/11/2018
Classifiers The choice of technique, in part, depends on the set of features that have been used Some techniques work better/worse with features with numerical values Some techniques work better/worse with features that have large numbers of possible values For example, the feature the word to the left has a fairly large number of possible values 11/11/2018
Types of Classifiers Naïve Bayes ŝ = p(s|V), or where s is one of the senses possible and V the input vector of features Assume features independent, so probability of V is the product of probabilities of each feature, given s, so and p(V) same for any ŝ Then 11/11/2018
Rule Induction Learners (e.g. Ripper) Given a feature vector of values for independent variables associated with observations of values for the training set (e.g. [fishing,NP,3,…] + bass2) Produce a set of rules that perform best on the training data, e.g. bass2 if w-1==‘fishing’ & pos==NP … 11/11/2018
Decision Lists like case statements applying tests to input in turn fish within window --> bass1 striped bass --> bass1 guitar within window --> bass2 bass player --> bass1 … Yarowsky ‘96’s approach orders tests by individual accuracy on entire training set based on log-likelihood ratio 11/11/2018
Bootstrapping I Bootstrapping II Start with a few labeled instances of target item as seeds to train initial classifier, C Use high confidence classifications of C on unlabeled data as training data Iterate Bootstrapping II Start with sentences containing words strongly associated with each sense (e.g. sea and music for bass), either intuitively or from corpus or from dictionary entries One Sense per Discourse hypothesis 11/11/2018
Statistical Word-Sense Disambiguation Where s is a vector of senses, V is the vector representation of the input By Bayesian rule By making independence assumption of meanings. This means that the result is the product of the probabilities of its individual features given that its sense 11/11/2018
Problems Given these general ML approaches, how many classifiers do I need to perform WSD robustly One for each ambiguous word in the language How do you decide what set of tags/labels/senses to use for a given word? Depends on the application 11/11/2018
Unsupervised Learning Cluster feature vectors to ‘discover’ word senses using some similarity metric (e.g. cosine distance) Represent each cluster as average of feature vectors it contains Label clusters by hand with known senses Classify unseen instances by proximity to these known and labeled clusters Evaluation problem What are the ‘right’ senses? 11/11/2018
How do you know how many clusters to create? Cluster impurity How do you know how many clusters to create? Some clusters may not map to ‘known’ senses 11/11/2018
Dictionary Approaches Problem of scale for all ML approaches Build a classifier for each sense ambiguity Machine readable dictionaries (Lesk ‘86) Retrieve all definitions of content words occurring in context of target (e.g. the happy seafarer ate the bass) Compare for overlap with sense definitions of target entry (bass2: a type of fish that lives in the sea) Choose sense with most overlap Limits: Entries are short --> expand entries to ‘related’ words 11/11/2018
Disambiguation using machine readable dictionaries Lesk’s approach [Lesk 1988] : Senses are represented by different definitions Look up context words definitions Find co-occurring words Select most similar sense Accuracy: 50% - 70%. Problem: not enough overlapping words between definitions 11/11/2018
Disambiguation using machine readable dictionaries Wilks’ approach [Wilks 1990] : Attempt to solve Lesk’s problem Expanding dictionary definition Use Longman Dictionary of Contemporary English ( LDOCE ) more word co-occurring evidence collected Accuracy: between 53% and 85%. 11/11/2018
Wilks’ approach [Wilks 1990] Commonly co-occurring words in LDOCE. [Wilks 1990] 11/11/2018
Disambiguation using machine readable dictionaries Luk’s approach [Luk 1995]: Statistical sense disambiguation Use definitions from LDOCE co-occurrence data collected from Brown corpus defining concepts : 1792 words used to write definitions of LDOCE LDOCE pre-processed :conceptual expansion 11/11/2018
Luk’s approach [Luk 1995]: Entry in LDOCE Conceptual expansion 1. (an order given by a judge which fixes) a punishment for a criminal found guilty in court found guilty in court { {order, judge, punish, crime, criminal,find, guilt, court}, 2. a group of words that forms a statement, command, exclamation, or question, usu. contains a subject and a verb, and (in writing) begins with a capital letter and ends with one of the marks. ! ? {group, word, form, statement, command, question, contain, subject, verb, write, begin, capital, letter, end, mark} } 11/11/2018 Noun “sentence” and its conceptual expansion [Luk 1995]
Luk’s approach [Luk 1995] cont. Collect co-occurrence data of defining concepts by constructing a two-dimensional Concept Co-occurrence Data Table (CCDT) Brown corpus divided into sentences collect conceptual co-occurrence data for each defining concept which occurs in the sentence Insert collect data in the Concept Co-occurrence Data Table. 11/11/2018
Luk’s approach [Luk 1995] cont. Score each sense S with respect to context C 11/11/2018 [Luk 1995]
Luk’s approach [Luk 1995] cont. Select sense with the highest score Accuracy: 77% Human accuracy: 71% 11/11/2018
Approaches using Roget's Thesaurus [Yarowsky 1992] Resources used: Roget's Thesaurus Grolier Multimedia Encyclopedia Senses of a word: categories in Roget's Thesaurus 1042 broad categories covering areas like, tools/machinery or animals/insects 11/11/2018
Approaches using Roget's Thesaurus [Yarowsky 1992] cont. tool, implement, appliance, contraption, apparatus, utensil, device, gadget, craft, machine, engine, motor, dynamo, generator, mill, lathe, equipment, gear, tackle, tackling, rigging, harness, trappings, fittings, accoutrements, paraphernalia, equipage, outfit, appointments, furniture, material, plant, appurtenances, a wheel, jack, clockwork, wheel-work, spring, screw, Some words placed into the tools/machinery category [Yarowsky 1992] 11/11/2018
Approaches using Roget's Thesaurus [Yarowsky 1992] cont. Collect context for each category: From Grolier Encyclopedia each occurrence of each member of the category extracts 100 surrounding words Sample occurrence of words in the tools/machinery category [Yarowsky 1992] 11/11/2018
Approaches using Roget's Thesaurus [Yarowsky 1992] cont. Identify and weight salient words: Sample salient words for Roget categories 348 and 414 [Yarowsky 1992] To disambiguate a word: sums up the weights of all salient words appearing in context Accuracy: 92% disambiguating 12 words 11/11/2018
Summary Many useful approaches developed to do WSD Future Next class: Supervised and unsupervised ML techniques Novel uses of existing resources (WN, dictionaries) Future More tagged training corpora becoming available New learning techniques being tested, e.g. co-training Next class: Ch 17:3-5 11/11/2018
11/11/2018
Disambiguation based on manually created rules Weiss’ approach [Lesk 1988] : set of rules to disambiguate five words context rule: within 5 words template rule: specific location accuracy : 90% IR improvement: 1% Small & Rieger’s approach [Small 1982] : Expert system 11/11/2018
Disambiguation using machine readable dictionaries Lesk’s approach [Lesk 1988] : Senses are represented by different definitions Looked up context words definitions Find co-occurring words Select most similar sense Accuracy: 50% - 70%. Problem: no enough overlapping words between definitions 11/11/2018
Disambiguation using machine readable dictionaries Wilks’ approach [Wilks 1990] : Attempt to solve Lesk’s problem Expanding dictionary definition Use Longman Dictionary of Contemporary English ( LDOCE ) more word co-occurring evidence collected Accuracy: between 53% and 85%. 11/11/2018
Wilks’ approach [Wilks 1990] Commonly co-occurring words in LDOCE. [Wilks 1990] 11/11/2018
Disambiguation using machine readable dictionaries Luk’s approach [Luk 1995]: Statistical sense disambiguation Use definitions from LDOCE co-occurrence data collected from Brown corpus defining concepts : 1792 words used to write definitions of LDOCE LDOCE pre-processed :conceptual expansion 11/11/2018
Luk’s approach [Luk 1995]: Entry in LDOCE Conceptual expansion 1. (an order given by a judge which fixes) a punishment for a criminal found guilty in court found guilty in court { {order, judge, punish, crime, criminal,find, guilt, court}, 2. a group of words that forms a statement, command, exclamation, or question, usu. contains a subject and a verb, and (in writing) begins with a capital letter and ends with one of the marks. ! ? {group, word, form, statement, command, question, contain, subject, verb, write, begin, capital, letter, end, mark} } 11/11/2018 Noun “sentence” and its conceptual expansion [Luk 1995]
Luk’s approach [Luk 1995] cont. Collect co-occurrence data of defining concepts by constructing a two-dimensional Concept Co-occurrence Data Table (CCDT) Brown corpus divided into sentences collect conceptual co-occurrence data for each defining concept which occurs in the sentence Insert collect data in the Concept Co-occurrence Data Table. 11/11/2018
Luk’s approach [Luk 1995] cont. Score each sense S with respect to context C 11/11/2018 [Luk 1995]
Luk’s approach [Luk 1995] cont. Select sense with the highest score Accuracy: 77% Human accuracy: 71% 11/11/2018
Approaches using Roget's Thesaurus [Yarowsky 1992] Resources used: Roget's Thesaurus Grolier Multimedia Encyclopedia Senses of a word: categories in Roget's Thesaurus 1042 broad categories covering areas like, tools/machinery or animals/insects 11/11/2018
Approaches using Roget's Thesaurus [Yarowsky 1992] cont. tool, implement, appliance, contraption, apparatus, utensil, device, gadget, craft, machine, engine, motor, dynamo, generator, mill, lathe, equipment, gear, tackle, tackling, rigging, harness, trappings, fittings, accoutrements, paraphernalia, equipage, outfit, appointments, furniture, material, plant, appurtenances, a wheel, jack, clockwork, wheel-work, spring, screw, Some words placed into the tools/machinery category [Yarowsky 1992] 11/11/2018
Approaches using Roget's Thesaurus [Yarowsky 1992] cont. Collect context for each category: From Grolier Encyclopedia each occurrence of each member of the category extracts 100 surrounding words Sample occurrence of words in the tools/machinery category [Yarowsky 1992] 11/11/2018
Approaches using Roget's Thesaurus [Yarowsky 1992] cont. Identify and weight salient words: Sample salient words for Roget categories 348 and 414 [Yarowsky 1992] To disambiguate a word: sums up the weights of all salient words appearing in context Accuracy: 92% disambiguating 12 words 11/11/2018
Introduction to WordNet(1) Online thesaurus system Synsets: Synonymous Words Hierachical Relationship 11/11/2018
Introduction to WordNet(2) [Sanderson 2000] 11/11/2018
Voorhees’ Disambg. Experiment Calculation of Semantic Distance: Synset and Context words Word’s Sense: Synset closest to Context Words Retrieval Result: Worse than non-Disambig. 11/11/2018
Gonzalo’s IR experiment(1) Two Questions Can WordNet really offer any potential for text retrieval How is text Retrieval performance affected by the disambiguation errors? 11/11/2018
Gonzalo’s IR experiment(2) Text Collection: Summary and Document Experiments 1. Standard Smart Run 2. Indexed In Terms of Word-Sense 3. Indexed In Terms of Synset 4. Introduction of Disambiguation Error 11/11/2018
Gonzalo’s IR experiment(3) Experiements %correct document retrieved Indexed by synsets 62.0 Indexing by word senses 53.2 Indexing by words 48.0 Indexing by synsets(5% error) 62.0 Id. with 10% errors 60.8 Id. with 20% errors 56.1 Id. with 30% errors 54.4 Id. with all possible 52.6 Id. with 60% errors 49.1 11/11/2018
Gonzalo’s IR experiment(4) Disambiguation with WordNet can improve text retrieval Solution lies in reliable Automatic WSD technique 11/11/2018
Disambiguation With Unsupervised Learning Yarowsky’s Unsupervised Method One Sense Per Collocation eg: Plant(manufacturing/life) One Sense Per Discourse eg: defense(War/Sports) 11/11/2018
Yarowsky’s Unsupervised Method cont. Algorithm Details Step1:Store Word and its contexts as line eg:….zonal distribution of plant life….. Step2: Identify a few words that represent the word Sense eg. plant(manufacturing/life) Step3a: Get rules from the training set plant + X => A, weight plant + Y => B, weight Step3b:Use the rules created in 3a to classify all occurrences of plant sample set. 11/11/2018
Yarowsky’s Unsupervised Method cont. Step3c: Use one-sense-per-discourse rule to filter or augment this addition Step3d: Repeat Step 3 a-b-c iteratively. Step4: the training converges on a stable residual set. Step 5: the result will be a set of rules. Those rules will be used to disambiguate the word “plant”. eg. plant + growth => life plant + car => manufacturing 11/11/2018
Yarowsky’s Unsupervised Method cont. Advantages of this method: Better accuracy compared to other unsupervised method No need for costly hand-tagged training sets(supervised method) 11/11/2018
Schütze and Pedersen’s approach [Schütze 1995] Source of word sense definitions Not using a dictionary or thesaurus Only using only the corpus to be disambiguated (Category B TREC-1 collection ) Thesaurus construction Collect a (symmetric ) term-term matrix C Entry cij : number of times that words i and j co-occur in a symmetric window of total size k Use SVD to reduce the dimensionality 11/11/2018
Schütze and Pedersen’s approach [Schütze 1995] cont. Thesaurus vector: columns Semantic similarity: cosine between columns Thesaurus: associate each word with its nearest neighbors Context vector: summing thesaurus vectors of context words 11/11/2018
Schütze and Pedersen’s approach [Schütze 1995] cont. Disambiguation algorithm Identify context vectors corresponding to all occurrences of a particular word Partition them into regions of high density Tag a sense for each such region Disambiguating a word: Compute context vector of its occurrence Find the closest centroid of a region Assign the occurrence the sense of that centroid 11/11/2018
Schütze and Pedersen’s approach [Schütze 1995] cont. Accuracy: 90% Application to IR replacing the words by word senses sense based retrieval’s average precision for 11 points of recall increased 4% with respect to word based. Combine the ranking for each document: average precision increased: 11% Each occurrence is assigned n(2,3,4,5) senses; average precision increased: 14% for n=3 11/11/2018
Schütze and Pedersen’s approach [Schütze 1995] cont. 11/11/2018
Conclusion How much can WSD help improve IR effectiveness? Open question Weiss: 1%, Voorhees’ method : negative Krovetz and Croft, Sanderson : only useful for short queries Schütze and Pedersen’s approaches and Gonzalo’s experiment : positive result WSD must be accurate to be useful for IR Schütze and Pedersen’s, Yarowsky’s algorithm: promising for IR Luk’s approach : robust for data sparse, suitable for small corpus. 11/11/2018
References [Krovetz 92] R. Krovetz & W.B. Croft (1992). Lexical Ambiguity and Information Retrieval, in ACM Transactions onInformation Systems, 10(1). Gonzalo 1998] J. Gonzalo, F. Verdejo, I. Chugur and J. Cigarran, “Indexing with WordNet synsets can improve Text Retrieval”, Proceedings of the COLING/ACL ’98 Workshop on Usage of WordNet for NLP, Montreal,1998 [Gonzalo 1992] R. Krovetz & W.B. Croft . “Lexical Ambiguity and Information Retrieval”, in ACM Transactions on Information Systems, 10(1), 1992 [Lesk 1988] M. Lesk , “They said true things, but called them by wrong names” – vocabulary problems in retrieval systems, in Proc. 4th Annual Conference of the University of Waterloo Centre for the New OED, 1988 [Luk 1995] A.K. Luk. “Statistical sense disambiguation with relatively small corpora using dictionary definitions”. In Proceedings of the 33rd Annual Meeting of the ACL, Columbus, Ohio, June 1995. Association for Computational Linguistics. [Salton 83] G. Salton & M.J. McGill (1983). Introduction To Modern Information Retrieval. The SMART and SIRE experimental retrieval systems, in New York: McGraw-Hill [Sanderson 1997] Sanderson, M. Word Sense Disambiguation and Information Retrieval, PhD Thesis, Technical Report (TR-1997-7) of the Department of Computing Science at the University of Glasgow, Glasgow G12 8QQ, UK. [Sanderson 2000] Sanderson, Mark, “Retrieving with Good Sense”, http://citeseer.nj.nec.com/sanderson00retrieving.html, 2000 11/11/2018
References cont. [Schütze 1995] H. Schütze & J.O. Pedersen. “Information retrieval based on word senses”, in Proceedings of the Symposium on Document Analysis and Information Retrieval, 4: 161-175. [Small 1982] S. Small & C. Rieger , “Parsing and comprehending with word experts (a theoryand its realisation) ” in Strategies for Natural Language Processing, W.G. Lehnert & M.H. Ringle, Eds., LEA: 89-148, 1982 [Voorhees 1993] E. M. Voorhees, “Using WordNet™ to disambiguate word sense for text retrieval, in Proceedings of ACM SIGIR Conference”, (16): 171-180. 1993 [Weiss 73] S.F. Weiss (1973). Learning to disambiguate, in Information Storage and Retrieval, 9:33-41, 1973 [Wilks 1990] Y. Wilks, D. Fass, C. Guo, J.E. Mcdonald, T. Plate, B.M. Slator (1990). ProvidingMachine Tractable Dictionary Tools, in Machine Translation, 5: 99-154, 1990 [Yarowsky 1992] D. Yarowsky, `“Word sense disambiguation using statistical models of Roget’s categories trained on large corpora, in Proceedings of COLING Conference”: 454-460, 1992 [Yarowsky 1994] Yarowsky, D. “Decision lists for lexical ambiguity resolution:Application to Accent Restoration in Spanish and French.” In Proceedings of the 32rd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, 1994 [Yarowsky 1995] Yarowsky, D. “Unsupervised word sense disambiguation rivaling supervised methods.” In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189-- 196, Cambridge, MA, 1995 11/11/2018