1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 23, 2011.
Semantic similarity, vector space models and word- sense disambiguation Corpora and Statistical Methods Lecture 6.
Fall 2001 EE669: Natural Language Processing 1 Lecture 7: Word Sense Disambiguation (Chapter 7 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department.
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Scalable Text Mining with Sparse Generative Models
Semi-Supervised Natural Language Learning Reading Group I set up a site at: ervised/
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Lexical Semantics CSCI-GA.2590 – Lecture 7A
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Word Sense Disambiguation Hsu Ting-Wei Presented by Patty Liu.
Natural Language Processing word sense disambiguation Updated 1/12/2005.
Word Sense Disambiguation Many words have multiple meanings –E.g, river bank, financial bank Problem: Assign proper sense to each ambiguous word in text.
Text Classification, Active/Interactive learning.
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Word Sense Disambiguation Minho Kim Foundation of Statistical Natural Language Processing.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Word Sense Disambiguation Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
The Unreasonable Effectiveness of Data
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
KNN & Naïve Bayes Hongning Wang
Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Data Mining Practical Machine Learning Tools and Techniques
Sentiment analysis algorithms and applications: A survey
Lecture 15: Text Classification & Naive Bayes
Statistical NLP: Lecture 13
Data Mining Lecture 11.
Lecture 21 Computational Lexical Semantics
Word Sense Disambiguation
Category-Based Pseudowords
Statistical NLP: Lecture 9
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Statistical NLP : Lecture 9 Word Sense Disambiguation
Statistical NLP: Lecture 10
Presentation transcript:

1 Statistical NLP: Lecture 9 Word Sense Disambiguation

2 Overview of the Problem 4 Problem: many words have different meanings or senses ==> there is ambiguity about how they are to be interpreted. 4 Task: to determine which of the senses of an ambiguous word is invoked in a particular use of the word. This is done by looking at the context of the word’s use. 4 Note: more often than not the different senses of a word are closely related.

3 Overview of our Discussion 4 Methodology 4 Supervised Disambiguation: based on a labeled training set. 4 Dictionary-Based Disambiguation: based on lexical resources such as dictionaries and thesauri. 4 Unsupervised Disambiguation: based on unlabeled corpora.

4 Methodological Preliminaries 4 Supervised versus Unsupervised Learning: in supervised learning the sense label of a word occurrence is known. In unsupervised learning, it is not known. 4 Pseudowords: used to generate artificial evaluation data for comparison and improvements of text-processing algorithms. 4 Upper and Lower Bounds on Performance: used to find out how well an algorithm performs relative to the difficulty of the task.

5 Supervised Disambiguation 4 Training set: exemplars where each occurrence of the ambiguous word w is annotated with a semantic label ==> Classification problem. 4 Approaches: –Bayesian Classification: the context of occurrence is treated as a bag of words without structure, but it integrates information from many words. –Information Theory: only looks at informative features in the context. These features may be sensitive to text structure. –There are many more approaches (see Chapter 16 or the Machine Learning course).

6 Supervised Disambiguation: Bayesian Classification I 4 (Gale et al, 1992)’s Idea: to look at the words around an ambiguous word in a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier does no feature selection. Instead, it combines the evidence from all features. 4 Bayes decision rule: Decide s’ if P(s’|C) > P(s k |C) for s k  s’. 4 P(s k |C) is computed by Bayes’ Rule.

7 Supervised Disambiguation: Bayesian Classification II 4 Naïve Bayes assumption: P(C|s k ) = P({v j | v j in C}| s k ) =  vj in C P(v j | s k ) 4 The Naïve Bayes assumption is incorrect in the context of text processing, but it is useful. 4 Decisionrule for Naïve Bayes: Decide s’ if s’=argmax sk [log P(s k )+  vj in C log P(v j |s k )] 4 P(v j |s k ) and P(s k ) are computed via Maximum- Likelihood Estimation, perhaps with appropriate smoothing, from the labeled training corpus.

8 Supervised Disambiguation: An Information-Theoretic Approach 4 (Brown et al., 1991)’s Idea: to find a single contextual feature that reliably indicates which sense of the ambiguous word is being used. 4 The Flip-Flop algorithm is used to disambiguate between the different senses of a word using the mutual information as a measure. 4 I(X;Y)=  x  X  y  Y p(x,y) log p(x,y)/(p(x)p(y)) 4 The algorithm works by searching for a partition of senses that maximizes the mutual information. The algorithm stops when the increase becomes insignificant.

9 Dictionary-Based Disambiguation: Overview 4 We will be looking at three different methods: –Disambiguation based on sense definitions –Thesaurus-Based Disambiguation –Disambiguation based on translations in a second-language corpus 4 Also, we will show how a careful examination of the distributional properties of senses can lead to significant improvements in disambiguation.

10 Disambiguation based on sense definitions 4 (Lesk, 1986: Idea): a word’s dictionary definitions are likely to be good indicators for the sense they define. 4 Express the dictionary sub-definitions of the ambiguous word as sets of bag-of-words and the words occurring in the context of the ambiguous word as single bags-of-words emanating from its dictionary definitions (all pooled together). 4 Disambiguate the ambiguous word by choosing the sub-definition of the ambiguous word that has the greatest overlap with the words occurring in its. context.

11 Thesaurus-Based Disambiguation 4 Idea: the semantic categories of the words in a context determine the semantic category of the context as a whole. This category, in turn, determines which word senses are used. 4 (Walker, 87): each word is assigned one or more subject codes which corresponds to its different meanings. For each subject code, we count the number of words (from the context) having the same subject code. We select the subject code corresponding to the highest count. 4 (Yarowski, 92): adapted the algorithm for words that do not occur in the thesaurus but that are very. Informative. E.g., Navratilova --> Sports

12 Disambiguation based on translations in a second-language corpus 4 (Dagan & Itai, 91, 91)’s Idea: words can be disambiguated by looking at how they are translated in other languages. 4 Example: the word “interest” has two translations in German: 1) “Beteiligung” (legal share--50% a interest in the company) 2) “Interesse” (attention, concern--her interest in Mathematics). 4 To disambiguate the word “interest”, we identify the sentence it occurs in, search a German corpus for instances of the phrase, and assign the meaning associated with the German use of the word in that phrase.

13 One sense per discourse, one sense per collocation 4 (Yarowsky, 1995)’s Idea: there are constraints between different occurrences of an ambiguous word within a corpus that can be exploited for disambiguation: –One sense per discourse: The sense of a target word is highly consistent within any given document. –One sense per collocation: nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order and syntactic relationship.

14 Unsupervised Disambiguation 4 Idea: disambiguate word senses without having recourse to supporting tools such as dictionaries and thesauri and in the absence of labeled text. Simply cluster the contexts of an ambiguous word into a number of groups and discriminate between these groups without labeling them. 4 (Schutze, 1998): The probabilistic model is the same Bayesian model as the one used for supervised classification, but the P(v j | s k ) are estimated using the EM algorithm.