Part 3. Knowledge-based Methods for Word Sense Disambiguation
Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Restrictions Measures of Semantic Similarity Heuristic-based Methods
Task Definition Knowledge-based WSD = class of WSD methods relying (mainly) on knowledge drawn from dictionaries and/or raw text Resources –Yes Machine Readable Dictionaries Raw corpora –No Manually annotated corpora Scope –All open class words
Machine Readable Dictionaries In recent years, most dictionaries made available in Machine Readable format (MRD) –Oxford English Dictionary –Collins –Longman Dictionary of Ordinary Contemporary English (LDOCE) Thesauri – add synonymy information –Roget Thesaurus Semantic networks – add more semantic relations –WordNet –EuroWordNet
MRD – A Resource for Knowledge-based WSD For each word in the language vocabulary, an MRD provides: –A list of meanings –Definitions (for all word meanings) –Typical usage examples (for most word meanings) WordNet definitions/examples for the noun plant 1.buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles“ 2.a living organism lacking the power of locomotion 3.something planted secretly for discovery by another; "the police used a plant to trick the thieves"; "he claimed that the evidence against him was a plant" 4.an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience
MRD – A Resource for Knowledge-based WSD A thesaurus adds: –An explicit synonymy relation between word meanings A semantic network adds: –Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF), antonymy, entailnment, etc. WordNet synsets for the noun “plant ” 1. plant, works, industrial plant 2. plant, flora, plant life WordNet related concepts for the meaning “plant life ” {plant, flora, plant life} hypernym: {organism, being} hypomym: {house plant}, {fungus}, … meronym: {plant tissue}, {plant part} holonym: {Plantae, kingdom Plantae, plant kingdom}
Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Restrictions Measures of Semantic Similarity Heuristic-based Methods
Lesk Algorithm (Michael Lesk 1986): Identify senses of words in context using definition overlap Algorithm: 1.Retrieve from MRD all sense definitions of the words to be disambiguated 2.Determine the definition overlap for all possible sense combinations 3.Choose senses that lead to highest overlap Example: disambiguate PINE CONE PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness CONE 1. solid body which narrows to a point 2. something of this shape whether solid or hollow 3. fruit of certain evergreen trees
Lesk Algorithm for More than Two Words? I saw a man who is 98 years old and can still walk and tell jokes –nine open class words: see(26), man(11), year(4), old(8), can(5), still(4), walk(10), tell(8), joke(3) 43,929,600 sense combinations! How to find the optimal sense combination? Simulated annealing (Cowie, Guthrie, Guthrie 1992) –Define a function E = combination of word senses in a given text. –Find the combination of senses that leads to highest definition overlap (redundancy) 1. Start with E = the most frequent sense for each word 2. At each iteration, replace the sense of a random word in the set with a different sense, and measure E 3. Stop iterating when there is no change in the configuration of senses
Lesk Algorithm: A Simplified Version Original Lesk defintion: measure overlap between sense definitions for all words in context –Identify simultaneously the correct senses for all words in context Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap between sense definitions of a word and current context –Identify the correct sense for one word at a time Algorithm for simplified Lesk: 1.Retrieve from MRD all sense definitions of the word to be disambiguated 2.Determine the overlap between each sense definition and the current context 3.Choose the sense that leads to highest overlap
Evaluations of Lesk Algorithm Initial evaluation by M. Lesk –50-70% on short samples of text manually annotated set, with respect to Oxford Advanced Learner’s Dictionary Simulated annealing –47% on 50 manually annotated sentences Evaluation on Senseval-2 all-words data, with back-off to random sense (Mihalcea & Tarau 2004) –Original Lesk: 35% –Simplified Lesk: 47% Evaluation on Senseval-2 all-words data, with back-off to most frequent sense (Vasilescu, Langlais, Lapalme 2004) –Original Lesk: 42% –Simplified Lesk: 58%
Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Restrictions Measures of Semantic Similarity Monosemous Equivalents Heuristic-based Methods
Selectional Restrictions A way to constrain the possible meanings of words in a given context E.g. “Wash a dish” vs. “Cook a dish” –WASH-OBJECT vs. COOK-FOOD Capture information about possible relations between semantic classes –Common sense knowledge Alternative terminology –Selectional Restrictions –Selectional Preferences –Selectional Constraints
Acquiring Selectional Restrictions From annotated corpora –Circular relationship with the WSD problem Need WSD to build the annotated corpus Need selectional restrictions to derive WSD From raw corpora –Frequency counts –Information theory measures –Class-to-class relations
Preliminaries: Learning Word-to-Word Relations An indication of the semantic fit between two words 1. Frequency counts –Pairs of words connected by a syntactic relations 2. Conditional probabilities –Condition on one of the words
Learning Selectional Restrictions (1) Word-to-class relations (Resnik 1993) –Quantify the contribution of a semantic class using all the concepts subsumed by that class –where
Learning Selectional Restrictions (2) Determine the contribution of a word sense based on the assumption of equal sense distributions: –e.g. “plant” has two senses --> 50% occurences are sense 1, 50% are sense 2 Example: learning restrictions for the verb “to drink” –Find high-scoring verb-object pairs –Find “prototypical” object classes (high association score)
Learning Selectional Restrictions (3) Class-to-class relations (Agirre and Martinez, 2002) E.g.: “ingest food” is a class-to-class relation for “eat chicken” Learn sense pair frequencies –From an annotated corpus –Using the equal sense distribution assumption Other methods for acquiring selectional restrictions –Bayesian networks (Ciaramita and Johnson, 2000) –Tree cut model (Li and Abe, 1998)
Using Selectional Restrictions for WSD Algorithm: 1.Learn a large set of selectional restrictions for a given syntactic relation R 2. Given a pair of words W 1 -W 2 connected by a relation R 3. Find all selectional restrictions W 1 —C (word-to-class) or C 1 —C 2 (class-to-class) that apply 4. Select the meanings of W 1 and W 2 based on the selected semantic class
Evaluation of Selectional Restrictions for WSD Data set –mainly on verb-object, subject-verb relations extracted from SemCor Compare against random baseline Results (Agirre and Martinez, 2000) –Average results on 8 nouns –Similar figures reported in (Resnik 1997)
Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Restrictions Measures of Semantic Similarity Monosemous Equivalents Heuristic-based Methods
Semantic Similarity Words in a discourse must be related in meaning, for the discourse to be coherent (Haliday and Hassan, 1976) Use this property for WSD – Identify related meanings for words that share a common context Context span: 1. Local context: semantic similarity between pairs of words 2. Global context: lexical chains
Semantic Similarity in a Local Context Similarity determined between pairs of concepts, or between a word and its surrounding context Relies on similarity metrics on semantic networks –(Rada et al. 1989) carnivore wild dogwolf bearfeline, felidcanine, canidfissiped mamal, fissiped dachshund hunting doghyena dogdingo hyenadog terrier
Concept-Pair Similarity Metrics (1) Input: two concepts (same part of speech) Output: similarity measure (Leacock and Chodorow 1998) –E.g. Similarity(wolf,dog) = 1 Similarity(wolf,bear) = 0.82 (Resnik 1995) –Define information content, P(C) = probability of seeing a concept of type C –Define similarity between two concepts –Alternatives (Jiang and Conrath 1997), (Lin 1998), D is the taxonomy depth
Concept-Pair Similarity Metrics (2) Find the semantic similarity between concepts, using gloss-based paths across different hierarchies –(Mihalcea and Moldovan 1999) –Applies equally well to words of different parts of speech where CD 12 is the number of common words between the definitions in the hierarchy of C 1 and the hierarchy of C 2 W k is the depth of the concept W k within the hierarchy
Concept-Pair Similarity Metrics for WSD Disambiguate target words based on similarity with one word to the left and one word to the right –(Patwardhan, Banerjee, Pedersen 2002) Evaluation: –1,723 ambiguous nouns from Senseval-2 –Among 5 similarity metrics, (Jiang and Conrath 1997) provide the best precision (39%)
Other Similarity Metrics (1) Conceptual density between nouns (Agirre and Rigau 1995) Measure the “conceptual density” as the overlap between: –the concepts in the hierarchy rooted at C and –the W k words encountered in the context of C, W k = nhyp α, nhyp = total hyponyms of W k, α=0.20 Determine the sense of a word based on highest “conceptual density” with surrounding context Evaluation: –Disambiguate SemCor nouns: 66.4% precision, 88.6% coverage
Other Similarity Metrics (2) Adapted Lesk algorithm –(Banerjee and Pedersen 2002) Measure the overlap between the context of the ambiguous word and the definitions of various senses –Include definitions of related words, e.g. hypernym/hyponym Evaluation: –1,723 ambiguous nouns from Senseval-2 –39.1% precision with the adapted Lesk algorithm
Semantic Similarity in a Global Context Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976) “A lexical chain is a sequence of semantically related words, which creates a context and contributes to the continuity of meaning and the coherence of a discourse” Algorithm for finding lexical chains: 1.Select the candidate words from the text. These are words for which we can compute similarity measures, and therefore most of the time they have the same part of speech. 2.For each such candidate word, and for each meaning for this word, find a chain to receive the candidate word sense, based on a semantic relatedness measure between the concepts that are already in the chain, and the candidate word meaning. 3.If such a chain is found, insert the word in this chain; otherwise, create a new chain.
Lexical Chains for WSD Identify lexical chains in a text –Usually target one part of speech at a time Identify the meaning of words based on their membership to a lexical chain Evaluation: –(Okumura and Honda 1994) lexical chains on five Japanese texts give 63.4% –(Mihalcea and Moldovan 2000) on five SemCor texts give 90% with 60% recall lexical chains “anchored” on monosemous words
Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Restrictions Measures of Semantic Similarity Heuristic-based Methods
Most Frequent Sense (1) Identify the most often used meaning and use this meaning by default Word meanings exhibit a Zipfian distribution –E.g. distribution of word senses in SemCor
Most Frequent Sense (2) Method 1: Find the most frequent sense in an annotated corpus Method 2: Find the most frequent sense using a method based on distributional similarity (McCarthy et al. 2004) 1.Given a word w, find the top k distributionally similar words N w = {n 1, n 2, …, n k }, with associated similarity scores {dss(w,n 1 ), dss(w,n 2 ), … dss(w,n k )} 2.For each sense ws i of w, identify the similarity with the words n j, using the sense of n j that maximizes this score 3.Rank senses ws i of w based on the total similarity score
One Sense Per Discourse A word tends to preserve its meaning across all its occurrences in a given discourse (Gale, Church, Yarowksy 1992) What does this mean? –E.g. The ambiguous word “plant” occurs 10 times in a discourse all instances of “plant” carry the same meaning Evaluation: –8 words with two-way ambiguity, e.g. plant, crane, etc. –98% of the two-word occurrences in the same discourse carry the same meaning The grain of salt: Performance depends on granularity –(Krovetz 1998) experiments with words with more than two senses –Performance of “one sense per discourse” measured on SemCor is approx. 70%
One Sense per Collocation A word tends to preserver its meaning when used in the same collocation (Yarowsky 1993) –Strong for adjacent collocations –Weaker as the distance between words increases An example –The ambiguous word “plant” preserves its meaning in all its occurrences within the collocation “industrial plant”, regardless of the context where this collocation occurs Evaluation: –97% precision on words with two-way ambiguity Finer granularity: –(Martinez and Agirre 2000) tested the “one sense per collocation” hypothesis on text annotated with WordNet senses –70% precision on SemCor words
References (Agirre and Rigau, 1995) Agirre, E. and Rigau, G. A proposal for word sense disambiguation using conceptual distance. RANLP (Agirre and Martinez 2001) Agirre, E. and Martinez, D. Learning class-to-class selectional preferences. CONLL (Banerjee and Pedersen 2002) Banerjee, S. and Pedersen, T. An adapted Lesk algorithm for word sense disambiguation using WordNet. CICLING (Cowie, Guthrie and Guthrie 1992), Cowie, L. and Guthrie, J. A. and Guthrie, L.: Lexical disambiguation using simulated annealing. COLING (Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. One sense per discourse. DARPA workshop (Hirst and St-Onge 1998) Hirst, G. and St-Onge, D. Lexical chains as representations of context in the detection and correction of malaproprisms. WordNet: An electronic lexical database, MIT Press. (Jiang and Conrath 1997) Jiang, J. and Conrath, D. Semantic similarity based on corpus statistics and lexical taxonomy. COLING (Krovetz, 1998) Krovetz, R. More than one sense per discourse. ACL-SIGLEX Workshop (Lesk, 1986) Lesk, M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. SIGDOC (Lin 1998) Lin, D An information theoretic definition of similarity. ICML (Martinez and Agirre 2000) Martinez, D. and Agirre, E. One sense per collocation and genre/topic variations. EMNLP 2000.
References (Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R. Using a semantic concordance for sense identification. ARPA Workshop (Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) (Mihalcea and Moldovan, 1999) Mihalcea, R. and Moldovan, D. A method for word sense disambiguation of unrestricted text. ACL (Mihalcea and Moldovan 2000) Mihalcea, R. and Moldovan, D. An iterative approach to word sense disambiguation. FLAIRS (Mihalcea, Tarau, Figa 2004) R. Mihalcea, P. Tarau, E. Figa PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING (Patwardhan, Banerjee, and Pedersen 2003) Patwardhan, S. and Banerjee, S. and Pedersen, T. Using Measures of Semantic Relatedeness for Word Sense Disambiguation. CICLING (Rada et al 1989) Rada, R. and Mili, H. and Bicknell, E. and Blettner, M. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1) (Resnik 1993) Resnik, P. Selection and Information: A Class-Based Approach to Lexical Relationships. University of Pennsylvania (Resnik 1995) Resnik, P. Using information content to evaluate semantic similarity. IJCAI (Vasilescu, Langlais, Lapalme 2004) F. Vasilescu, P. Langlais, G. Lapalme "Evaluating variants of the Lesk approach for disambiguating words”, LREC (Yarowsky, 1993) Yarowsky, D. One sense per collocation. ARPA Workshop 1993.