Advances in Word Sense Disambiguation Tutorial at AAAI-2005 July 9, 2005 Rada Mihalcea University of North Texas Ted Pedersen.

Slides:

Advertisements

Similar presentations

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Advertisements

Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme Presented by Smitashree Choudhury.

Albert Gatt Corpora and Statistical Methods Lecture 13.

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

11 Chapter 20 Part 2 Computational Lexical Semantics Acknowledgements: these slides include material from Rada Mihalcea, Ray Mooney, Katrin Erk, and Ani.

Word sense disambiguation (1) Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide set was adapted.

1 Extended Gloss Overlaps as a Measure of Semantic Relatedness Satanjeev Banerjee Ted Pedersen Carnegie Mellon University University of Minnesota Duluth.

LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.

Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 23, 2011.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.

1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.

Advances in Word Sense Disambiguation Tutorial at ACL 2005 June 25, 2005 Ted Pedersen University of Minnesota, Duluth Rada.

Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.

Semi-Supervised Natural Language Learning Reading Group I set up a site at: ervised/

Natural Language Processing Chapter 19 Computational Lexical Semantics Part 2 [Includes slides from a AAAI-2005 tutorial by Rada Mihalcea and Ted Pedersen]

Aiding WSD by exploiting hypo/hypernymy relations in a restricted framework MEANING project Experiment 6.H(d) Luis Villarejo and Lluís M à rquez.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

Advances in Word Sense Disambiguation Tutorial at AAAI-2005 July 9, 2005 Rada Mihalcea University of North Texas Ted Pedersen.

Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,

Francisco Viveros-Jiménez Alexander Gelbukh Grigori Sidorov.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

1 Statistical NLP: Lecture 10 Lexical Acquisition.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.

Jiuling Zhang  Why perform query expansion?  WordNet based Word Sense Disambiguation WordNet Word Sense Disambiguation  Conceptual Query.

Part 3. Knowledge-based Methods for Word Sense Disambiguation.

Word Sense Disambiguation Part I – Introduction Alexander Fraser CIS, LMU München WSD and MT.

Word Sense Disambiguation (WSD)

Word Sense Disambiguation Many words have multiple meanings –E.g, river bank, financial bank Problem: Assign proper sense to each ambiguous word in text.

Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.

Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.

W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.

SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Part 5. Minimally Supervised Methods for Word Sense Disambiguation.

1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.

Disambiguation Read J & M Chapter 17.1 – The Problem Washington Loses Appeal on Steel Duties Sue caught the bass with the new rod. Sue played the.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.

Using Semantic Relatedness for Word Sense Disambiguation

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.

1 Measuring the Semantic Similarity of Texts Author ： Courtney Corley and Rada Mihalcea Source ： ACL-2005 Reporter ： Yong-Xiang Chen.

1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.

Knowledge-based Methods for Word Sense Disambiguation From a tutorial at AAAI by Ted Pedersen and Rada Mihalcea [edited by J. Wiebe]

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Finding Predominant Word Senses in Untagged Text Diana McCarthy & Rob Koeling & Julie Weeds & Carroll Department of Indormatics, University of Sussex {dianam,

Lecture 21 Computational Lexical Semantics

Statistical NLP: Lecture 9

WordNet WordNet, WSD.

Text Categorization Berlin Chen 2003 Reference:

Unsupervised Word Sense Disambiguation Using Lesk algorithm

Information Retrieval

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

Advances in Word Sense Disambiguation Tutorial at AAAI-2005 July 9, 2005 Rada Mihalcea University of North Texas Ted Pedersen University of Minnesota, Duluth [Note: slides have been modified/deleted/added] [For those interested in lexical semantics, I suggest getting the entire tutorial]

2 Definitions Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. –Sense Inventory usually comes from a dictionary or thesaurus. –Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory. –Unsupervised techniques

3 Word Senses The meaning of a word in a given context Word sense representations –With respect to a dictionary chair = a seat for one person, with a support for the back; "he put his coat over the back of the chair and sat down" chair = the position of professor; "he was awarded an endowed chair in economics" –With respect to the translation in a second language chair = chaise chair = directeur –With respect to the context where it occurs (discrimination) “Sit on a chair” “Take a seat on this chair” “The chair of the Math Department” “The chair of the meeting”

4 Approaches to Word Sense Disambiguation Knowledge-Based Disambiguation –use of external lexical resources such as dictionaries and thesauri –discourse properties Supervised Disambiguation –based on a labeled training set –the learning system has: a training set of feature-encoded inputs AND their appropriate sense label (category) Unsupervised Disambiguation –based on unlabeled corpora –The learning system has: a training set of feature-encoded inputs BUT NOT their appropriate sense label (category)

5 All Words Word Sense Disambiguation Attempt to disambiguate all open-class words in a text “He put his suit over the back of the chair” Use information from dictionaries –Definitions / Examples for each meaning Find similarity between definitions and current context Position in a semantic network Find that “table” is closer to “chair/furniture” than to “chair/person” Use discourse properties A word exhibits the same sense in a discourse / in a collocation

6 All Words Word Sense Disambiguation Minimally supervised approaches –Learn to disambiguate words using small annotated corpora –E.g. SemCor – corpus where all open class words are disambiguated 200,000 running words Most frequent sense

7 Targeted Word Sense Disambiguation (we saw this in the previous lecture notes) Disambiguate one target word “Take a seat on this chair” “The chair of the Math Department” WSD is viewed as a typical classification problem –use machine learning techniques to train a system Training: –Corpus of occurrences of the target word, each occurrence annotated with appropriate sense –Build feature vectors: a vector of relevant linguistic features that represents the context (ex: a window of words around the target word) Disambiguation: –Disambiguate the target word in new unseen text

8 Unsupervised Disambiguation Disambiguate word senses: –without supporting tools such as dictionaries and thesauri –without a labeled training text Without such resources, word senses are not labeled –We cannot say “chair/furniture” or “chair/person” We can: –Cluster/group the contexts of an ambiguous word into a number of groups –Discriminate between these groups without actually labeling them

9 Unsupervised Disambiguation Hypothesis: same senses of words will have similar neighboring words Disambiguation algorithm –Identify context vectors corresponding to all occurrences of a particular word –Partition them into regions of high density –Assign a sense to each such region “Sit on a chair” “Take a seat on this chair” “The chair of the Math Department” “The chair of the meeting”

10 Bounds on Performance Upper and Lower Bounds on Performance: –Measure of how well an algorithm performs relative to the difficulty of the task. Upper Bound: –Human performance –Around 97%-99% with few and clearly distinct senses –Inter-judge agreement: With words with clear & distinct senses – 95% and up With polysemous words with related senses – 65% – 70% Lower Bound (or baseline): –The assignment of a random sense / the most frequent sense 90% is excellent for a word with 2 equiprobable senses 90% is trivial for a word with 2 senses with probability ratios of 9 to 1

11 References (Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. Estimating upper and lower bounds on the performance of word-sense disambiguation programs ACL (Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R. Using a semantic concordance for sense identification. ARPA Workshop (Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) (Senseval) Senseval evaluation exercises

Part 3: Knowledge-based Methods for Word Sense Disambiguation

13 Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Restrictions Measures of Semantic Similarity Heuristic-based Methods

14 Task Definition Knowledge-based WSD = class of WSD methods relying (mainly) on knowledge drawn from dictionaries and/or raw text Resources –Yes Machine Readable Dictionaries Raw corpora –No Manually annotated corpora Scope –All open-class words

15 Machine Readable Dictionaries In recent years, most dictionaries made available in Machine Readable format (MRD) –Oxford English Dictionary –Collins –Longman Dictionary of Ordinary Contemporary English (LDOCE) Thesauruses – add synonymy information –Roget Thesaurus Semantic networks – add more semantic relations –WordNet –EuroWordNet

16 MRD – A Resource for Knowledge-based WSD For each word in the language vocabulary, an MRD provides: –A list of meanings –Definitions (for all word meanings) –Typical usage examples (for most word meanings) WordNet definitions/examples for the noun plant 1.buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles“ 2.a living organism lacking the power of locomotion 3.something planted secretly for discovery by another; "the police used a plant to trick the thieves"; "he claimed that the evidence against him was a plant" 4.an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience

17 MRD – A Resource for Knowledge-based WSD A thesaurus adds: –An explicit synonymy relation between word meanings A semantic network adds: –Hypernymy/hyponymy (IS-A), meronymy (PART-OF), antonymy, entailment, etc. WordNet synsets for the noun “plant” 1. plant, works, industrial plant 2. plant, flora, plant life WordNet related concepts for the meaning “plant life” {plant, flora, plant life} hypernym: {organism, being} hypomym: {house plant}, {fungus}, … meronym: {plant tissue}, {plant part} holonym: {Plantae, kingdom Plantae, plant kingdom}

18 Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Restrictions Measures of Semantic Similarity Heuristic-based Methods

19 Lesk Algorithm (Michael Lesk 1986): Identify senses of words in context using definition overlap Algorithm: 1.Retrieve from MRD all sense definitions of the words to be disambiguated 2.Determine the definition overlap for all possible sense combinations 3.Choose senses that lead to highest overlap Example: disambiguate PINE CONE PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness CONE 1. solid body which narrows to a point 2. something of this shape whether solid or hollow 3. fruit of certain evergreen trees Pine#1  Cone#1 = 0 Pine#2  Cone#1 = 0 Pine#1  Cone#2 = 1 Pine#2  Cone#2 = 0 Pine#1  Cone#3 = 2 Pine#2  Cone#3 = 0

20 Lesk Algorithm: A Simplified Version Original Lesk definition: measure overlap between sense definitions for all words in context –Identify simultaneously the correct senses for all words in context Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap between sense definitions of a word and current context –Identify the correct sense for one word at a time Search space significantly reduced

21 Lesk Algorithm: A Simplified Version Example: disambiguate PINE in “Pine cones hanging in a tree” PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness Pine#1  Sentence = 1 Pine#2  Sentence = 0 Algorithm for simplified Lesk: 1.Retrieve from MRD all sense definitions of the word to be disambiguated 2.Determine the overlap between each sense definition and the current context 3.Choose the sense that leads to highest overlap

22 Evaluations of Lesk Algorithm Initial evaluation by M. Lesk –50-70% on short samples of text manually annotated set, with respect to Oxford Advanced Learner’s Dictionary Evaluation on Senseval-2 all-words data, with back-off to random sense (Mihalcea & Tarau 2004) –Original Lesk: 35% –Simplified Lesk: 47% Evaluation on Senseval-2 all-words data, with back-off to most frequent sense (Vasilescu, Langlais, Lapalme 2004) –Original Lesk: 42% –Simplified Lesk: 58%

23 Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Preferences Measures of Semantic Similarity Heuristic-based Methods

24 Selectional Preferences A way to constrain the possible meanings of words in a given context E.g. “Wash a dish” vs. “Cook a dish” –WASH-OBJECT vs. COOK-FOOD Capture information about possible relations between semantic classes –Common sense knowledge Alternative terminology –Selectional Restrictions –Selectional Preferences –Selectional Constraints

25 Acquiring Selectional Preferences From annotated corpora –We saw this in the previous lecture notes From raw corpora –Frequency counts –Information theory measures –Class-to-class relations

26 Preliminaries: Learning Word-to-Word Relations An indication of the semantic fit between two words 1. Frequency counts –Pairs of words connected by a syntactic relations 2. Conditional probabilities –Condition on one of the words

27 Learning Selectional Preferences (1) (p. 14 in Chapter 19; you won’t be responsible for this formula) Word-to-class relations (Resnik 1993) –Quantify the contribution of a semantic class using all the concepts subsumed by that class –where

28 Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Restrictions Measures of Semantic Similarity Heuristic-based Methods

29 Semantic Similarity Words in a discourse must be related in meaning, for the discourse to be coherent (Haliday and Hassan, 1976) Use this property for WSD – Identify related meanings for words that share a common context

30 See Figure 19.6 in the chapter Basic idea: the shorter the path between two senses in a semantic network, the more similar they are. So, you can see that nickel, dime are closer to budget than they are to Richter scale (in Figure 19.6)

31 Semantic Similarity Metrics (1) Input: two concepts (same part of speech) Output: similarity measure (Leacock and Chodorow 1998) –E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42 (Resnik 1995) –Define information content, where P(C) = probability of seeing a concept of type C in a large corpus –Probability of seeing a concept = probability of seeing instances of that concept, D is the taxonomy depth

32 Semantic Similarity Metrics (2) Similarity using information content –(Resnik 1995) Define similarity between two concepts (LCS = Least Common Subsumer) –Alternatives (Jiang and Conrath 1997) Other metrics: –Similarity using information content (Lin 1998) –Similarity using gloss-based paths across different hierarchies (Mihalcea and Moldovan 1999) –Conceptual density measure between noun semantic hierarchies and current context (Agirre and Rigau 1995) –Adapted Lesk algorithm (Banerjee and Pedersen 2002)

33 Example: “plant/flora” is used more often than “plant/factory” - annotate any instance of PLANT as “plant/flora” Most Frequent Sense (1) Identify the most often used meaning and use this meaning by default Word meanings exhibit a Zipfian distribution –E.g. distribution of word senses in SemCor

34 Most Frequent Sense (2) (you aren’t responsible for this) Method 1: Find the most frequent sense in an annotated corpus Method 2: Find the most frequent sense using a method based on distributional similarity (McCarthy et al. 2004) 1. Given a word w, find the top k distributionally similar words N w = {n 1, n 2, …, n k }, with associated similarity scores {dss(w,n 1 ), dss(w,n 2 ), … dss(w,n k )} 2. For each sense ws i of w, identify the similarity with the words n j, using the sense of n j that maximizes this score 3. Rank senses ws i of w based on the total similarity score

35 Most Frequent Sense(3) Word senses –pipe #1 = tobacco pipe –pipe #2 = tube of metal or plastic Distributional similar words –N = {tube, cable, wire, tank, hole, cylinder, fitting, tap, …} For each word in N, find similarity with pipe#i (using the sense that maximizes the similarity) –pipe#1 – tube (#3) = 0.3 –pipe#2 – tube (#1) = 0.6 Compute score for each sense pipe#i –score (pipe#1) = 0.25 –score (pipe#2) = 0.73 Note: results depend on the corpus used to find distributionally similar words => can find domain specific predominant senses

36 E.g. The ambiguous word PLANT occurs 10 times in a discourse all instances of “plant” carry the same meaning One Sense Per Discourse A word tends to preserve its meaning across all its occurrences in a given discourse (Gale, Church, Yarowksy 1992) What does this mean? Evaluation: –8 words with two-way ambiguity, e.g. plant, crane, etc. –98% of the two-word occurrences in the same discourse carry the same meaning The grain of salt: Performance depends on granularity –(Krovetz 1998) experiments with words with more than two senses –Performance of “one sense per discourse” measured on SemCor is approx. 70%

37 The ambiguous word PLANT preserves its meaning in all its occurrences within the collocation “industrial plant”, regardless of the context where this collocation occurs One Sense per Collocation A word tends to preserve its meaning when used in the same collocation (Yarowsky 1993) –Strong for adjacent collocations –Weaker as the distance between words increases An example Evaluation: –97% precision on words with two-way ambiguity Finer granularity: –(Martinez and Agirre 2000) tested the “one sense per collocation” hypothesis on text annotated with WordNet senses –70% precision on SemCor words

38 References (Agirre and Rigau, 1995) Agirre, E. and Rigau, G. A proposal for word sense disambiguation using conceptual distance. RANLP (Agirre and Martinez 2001) Agirre, E. and Martinez, D. Learning class-to-class selectional preferences. CONLL (Banerjee and Pedersen 2002) Banerjee, S. and Pedersen, T. An adapted Lesk algorithm for word sense disambiguation using WordNet. CICLING (Cowie, Guthrie and Guthrie 1992), Cowie, L. and Guthrie, J. A. and Guthrie, L.: Lexical disambiguation using simulated annealing. COLING (Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. One sense per discourse. DARPA workshop (Halliday and Hasan 1976) Halliday, M. and Hasan, R., (1976). Cohesion in English. Longman. (Galley and McKeown 2003) Galley, M. and McKeown, K. (2003) Improving word sense disambiguation in lexical chaining. IJCAI 2003 (Hirst and St-Onge 1998) Hirst, G. and St-Onge, D. Lexical chains as representations of context in the detection and correction of malaproprisms. WordNet: An electronic lexical database, MIT Press. (Jiang and Conrath 1997) Jiang, J. and Conrath, D. Semantic similarity based on corpus statistics and lexical taxonomy. COLING (Krovetz, 1998) Krovetz, R. More than one sense per discourse. ACL-SIGLEX (Lesk, 1986) Lesk, M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. SIGDOC (Lin 1998) Lin, D An information theoretic definition of similarity. ICML 1998.

39 References (Martinez and Agirre 2000) Martinez, D. and Agirre, E. One sense per collocation and genre/topic variations. EMNLP (Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R. Using a semantic concordance for sense identification. ARPA Workshop (Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) (Mihalcea and Moldovan, 1999) Mihalcea, R. and Moldovan, D. A method for word sense disambiguation of unrestricted text. ACL (Mihalcea and Moldovan 2000) Mihalcea, R. and Moldovan, D. An iterative approach to word sense disambiguation. FLAIRS (Mihalcea, Tarau, Figa 2004) R. Mihalcea, P. Tarau, E. Figa PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING (Patwardhan, Banerjee, and Pedersen 2003) Patwardhan, S. and Banerjee, S. and Pedersen, T. Using Measures of Semantic Relatedeness for Word Sense Disambiguation. CICLING (Rada et al 1989) Rada, R. and Mili, H. and Bicknell, E. and Blettner, M. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1) (Resnik 1993) Resnik, P. Selection and Information: A Class-Based Approach to Lexical Relationships. University of Pennsylvania (Resnik 1995) Resnik, P. Using information content to evaluate semantic similarity. IJCAI (Vasilescu, Langlais, Lapalme 2004) F. Vasilescu, P. Langlais, G. Lapalme "Evaluating variants of the Lesk approach for disambiguating words”, LREC (Yarowsky, 1993) Yarowsky, D. One sense per collocation. ARPA Workshop 1993.

Part 4: Supervised Methods of Word Sense Disambiguation [This section has been deleted]

Part 5: Minimally Supervised Methods for Word Sense Disambiguation

42 Outline Task definition –What does “minimally” supervised mean? Bootstrapping algorithms –Co-training –Self-training –Yarowsky algorithm Using the Web for Word Sense Disambiguation –Web as a corpus –Web as collective mind

43 Task Definition Supervised WSD = learning sense classifiers starting with annotated data Minimally supervised WSD = learning sense classifiers from annotated data, with minimal human supervision Examples –Automatically bootstrap a corpus starting with a few human annotated examples –Use monosemous relatives / dictionary definitions to automatically construct sense tagged data

44 Outline Task definition –What does “minimally” supervised mean? Bootstrapping algorithms –Co-training –Self-training –Yarowsky algorithm Using the Web for Word Sense Disambiguation –Web as a corpus –Web as collective mind

45 Bootstrapping Recipe Ingredients –(Some) labeled data –(Large amounts of) unlabeled data –(One or more) basic classifiers Output –Classifier that improves over the basic classifiers

46 … plants#1 and animals … … industry plant#2 … … building the only atomic plant … … plant growth is retarded … … a herb or flowering plant … … a nuclear power plant … … building a new vehicle plant … … the animal and plant life … … the passion-fruit plant … Classifier 1 Classifier 2 … plant#1 growth is retarded … … a nuclear power plant#2 …

47 Co-training/Self-training 1. Create a pool of examples U' –choose P random examples from U 2. Loop for I iterations –Train C i on L and label U' –Select G most confident examples and add to L maintain distribution in L –Refill U' with examples from U keep U' at constant size P –A set L of labeled training examples –A set U of unlabeled examples –Classifiers C i

48 (Blum and Mitchell 1998) Two classifiers –independent views –[independence condition can be relaxed] Co-training in Natural Language Learning –Statistical parsing (Sarkar 2001) –Co-reference resolution (Ng and Cardie 2003) –Part of speech tagging (Clark, Curran and Osborne 2003) –... Co-training

49 Self-training (Nigam and Ghani 2000) One single classifier Retrain on its own output Self-training for Natural Language Learning –Part of speech tagging (Clark, Curran and Osborne 2003) –Co-reference resolution (Ng and Cardie 2003) several classifiers through bagging

50 Yarowsky Algorithm (Yarowsky 1995) Similar to co-training Relies on two heuristics and a decision list –One sense per collocation : Nearby words provide strong and consistent clues as to the sense of a target word –One sense per discourse : The sense of a target word is highly consistent within a single document

51 Learning Algorithm A decision list is used to classify instances of target word : “the loss of animal and plant species through extinction …” Classification is based on the highest ranking rule that matches the target context LogLCollocationSense ……… 9.31flower (within +/- k words)  A (living) 9.24job (within +/- k words)  B (factory) 9.03fruit (within +/- k words)  A (living) 9.02plant species  A (living)... …

52 Bootstrapping Algorithm All occurrences of the target word are identified A small training set of seed data is tagged with word sense Sense-B: factory Sense-A: life

53 Bootstrapping Algorithm Seed set grows and residual set shrinks ….

54 Bootstrapping Algorithm Convergence: Stop when residual set stabilizes

55 Bootstrapping Algorithm Iterative procedure: –Train decision list algorithm on seed set –Classify residual data with decision list –Create new seed set by identifying samples that are tagged with a probability above a certain threshold –Retrain classifier on new seed set Selecting training seeds –Initial training set should accurately distinguish among possible senses –Strategies: Select a single, defining seed collocation for each possible sense. Ex: “life” and “manufacturing” for target plant Use words from dictionary definitions Hand-label most frequent collocates

56 Evaluation Test corpus: extracted from 460 million word corpus of multiple sources (news articles, transcripts, novels, etc.) Performance of multiple models compared with: –supervised decision lists –unsupervised learning algorithm of Schütze (1992), based on alignment of clusters with word senses WordSensesSupervisedUnsupervised Schütze Unsupervised Bootstrapping plantliving/factory spacevolume/outer tankvehicle/container motionlegal/physical ………-… Avg

57 The Web as a Corpus [This topic has been deleted] Use the Web as a large textual corpus –Build annotated corpora using monosemous relatives –Bootstrap annotated corpora starting with few seeds Similar to (Yarowsky 1995) Use the (semi)automatically tagged data to train WSD classifiers

58 References (Abney 2002) Abney, S. Bootstrapping. Proceedings of ACL (Blum and Mitchell 1998) Blum, A. and Mitchell, T. Combining labeled and unlabeled data with co-training. Proceedings of COLT (Chklovski and Mihalcea 2002) Chklovski, T. and Mihalcea, R. Building a sense tagged corpus with Open Mind Word Expert. Proceedings of ACL 2002 workshop on WSD. (Clark, Curran and Osborne 2003) Clark, S. and Curran, J.R. and Osborne, M. Bootstrapping POS taggers using unlabelled data. Proceedings of CoNLL (Mihalcea 1999) Mihalcea, R. An automatic method for generating sense tagged corpora. Proceedings of AAAI (Mihalcea 2002) Mihalcea, R. Bootstrapping large sense tagged corpora. Proceedings of LREC (Mihalcea 2004) Mihalcea, R. Co-training and Self-training for Word Sense Disambiguation. Proceedings of CoNLL (Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly supervised natural language learning without redundant views. Proceedings of HLT-NAACL (Nigam and Ghani 2000) Nigam, K. and Ghani, R. Analyzing the effectiveness and applicability of co-training. Proceedings of CIKM (Sarkar 2001) Sarkar, A. Applying cotraining methods to statistical parsing. Proceedings of NAACL (Yarowsky 1995) Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of ACL 1995.

Part 6: Unsupervised Methods of Word Sense Discrimination

60 Outline What is Unsupervised Learning? Task Definition Agglomerative Clustering LSI/LSA Sense Discrimination Using Parallel Texts

61 What is Unsupervised Learning? Unsupervised learning identifies patterns in a large sample of data, without the benefit of any manually labeled examples or external knowledge sources These patterns are used to divide the data into clusters, where each member of a cluster has more in common with the other members of its own cluster than any other Note! If you remove manual labels from supervised data and cluster, you may not discover the same classes as in supervised learning –Supervised Classification identifies features that trigger a sense tag –Unsupervised Clustering finds similarity between contexts

62 Task Definition Word Sense Discrimination reduces to the problem of finding the targeted words that occur in the most similar contexts and placing them in a cluster

63 Agglomerative Clustering Create a similarity matrix of instances to be discriminated –Results in a symmetric “instance by instance” matrix, where each cell contains the similarity score between a pair of instances –Typically a first order representation, where similarity is based on the features observed in the pair of instances Apply Agglomerative Clustering algorithm to matrix –To start, each instance is its own cluster –Form a cluster from the most similar pair of instances –Repeat until the desired number of clusters is obtained Advantages : high quality clustering Disadvantages – computationally expensive, must carry out exhaustive pair wise comparisons

64 Measuring Similarity (you don’t need to know these) Integer Values –Matching Coefficient –Jaccard Coefficient –Dice Coefficient Real Values –Cosine

65 Evaluation of Unsupervised Methods If Sense tagged text is available, can be used for evaluation Assume that sense tags represent “true” clusters, and compare these to discovered clusters –Find mapping of clusters to senses that attains maximum accuracy Pseudo words are especially useful, since it is hard to find data that is discriminated –Pick two words or names from a corpus, and conflate them into one name. Then see how well you can discriminate. – Baseline Algorithm– group all instances into one cluster, this will reach “accuracy” equal to majority classifier

66 Sense Discrimination Using Parallel Texts There is controversy as to what exactly is a “word sense” (e.g., Kilgarriff, 1997) It is sometimes unclear how fine grained sense distinctions need to be to be useful in practice. Parallel text may present a solution to both problems! –Text in one language and its translation into another Resnik and Yarowsky (1997) suggest that word sense disambiguation concern itself with sense distinctions that manifest themselves across languages. –A “bill” in English may be a “pico” (bird jaw) in or a “cuenta” (invoice) in Spanish.

67 Parallel Text Parallel Text can be found on the Web and there are several large corpora available (e.g., UN Parallel Text, Canadian Hansards) Manual annotation of sense tags is not required! However, text must be word aligned (translations identified between the two languages). – Workshop on Parallel Text, NAACL 2003 Given word aligned parallel text, sense distinctions can be discovered. (e.g., Li and and Li, 2002, Diab, 2002)

68 References (Diab, 2002) Diab, Mona and Philip Resnik, An Unsupervised Method for Word Sense Tagging using Parallel Corpora, Proceedings of ACL, (Firth, 1957) A Synopsis of Linguistic Theory In Studies in Linguistic Analysis, Oxford University Press, Oxford. (Kilgarriff, 1997) “I don’t believe in word senses”, Computers and the Humanities (31) pp (Li and Li, 2002) Word Translation Disambiguation Using Bilingual Bootstrapping. Proceedings of ACL. Pp (McQuitty, 1966) Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological Measurement (26) pp (Miller and Charles, 1991) Contextual correlates of semantic similarity. Language and Cognitive Processes, 6 (1) pp (Pedersen and Bruce, 1997) Distinguishing Word Sense in Untagged Text. In Proceedings of EMNLP2. pp (Purandare and Pedersen, 2004) Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces. Proceedings of the Conference on Natural Language and Learning. pp (Resnik and Yarowsky, 1997) A Perspective on Word Sense Disambiguation Methods and their Evaluation. The ACL-SIGLEX Workshop Tagging Text with Lexical Semantics. pp (Schutze, 1998) Automatic Word Sense Discrimination. Computational Linguistics, 24 (1) pp

69 Outline [most of this section deleted] Where to get the required ingredients? –Machine Readable Dictionaries –Machine Learning Algorithms –Sense Annotated Data –Raw Data Where to get WSD software? How to get your algorithms tested? –Senseval

70 Senseval Evaluation of WSD systems Senseval 1: 1999 – about 10 teams Senseval 2: 2001 – about 30 teams Senseval 3: 2004 – about 55 teams Senseval 4: 2007(?) Provides sense annotated data for many languages, for several tasks –Languages: English, Romanian, Chinese, Basque, Spanish, etc. –Tasks: Lexical Sample, All words, etc. Provides evaluation software Provides results of other participating systems

71 Thank You! Rada Mihalcea – Ted Pedersen –