Statistical NLP : Lecture 9 Word Sense Disambiguation

Slides:

Advertisements

Similar presentations

Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Advertisements

January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.

Semantic similarity, vector space models and word- sense disambiguation Corpora and Statistical Methods Lecture 6.

Fall 2001 EE669: Natural Language Processing 1 Lecture 7: Word Sense Disambiguation (Chapter 7 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.

Semi-Supervised Natural Language Learning Reading Group I set up a site at: ervised/

Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Albert Gatt Corpora and Statistical Methods Lecture 9.

1 Statistical NLP: Lecture 10 Lexical Acquisition.

Word Sense Disambiguation Hsu Ting-Wei Presented by Patty Liu.

Natural Language Processing word sense disambiguation Updated 1/12/2005.

Word Sense Disambiguation Many words have multiple meanings –E.g, river bank, financial bank Problem: Assign proper sense to each ambiguous word in text.

Text Classification, Active/Interactive learning.

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.

SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee

Word Sense Disambiguation Minho Kim Foundation of Statistical Natural Language Processing.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Word Sense Disambiguation Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.

1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.

Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

KNN & Naïve Bayes Hongning Wang

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.

Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Data Mining Practical Machine Learning Tools and Techniques

Semi-Supervised Clustering

Coarse-grained Word Sense Disambiguation

Statistical NLP: Lecture 7

Sentiment analysis algorithms and applications: A survey

Natural Language Processing (NLP)

CSC 594 Topics in AI – Natural Language Processing

Lecture 15: Text Classification & Naive Bayes

Statistical NLP: Lecture 13

Data Mining Lecture 11.

Lecture 21 Computational Lexical Semantics

CSE P573 Applications of Artificial Intelligence Bayesian Learning

Word Sense Disambiguation

Category-Based Pseudowords

Statistical NLP: Lecture 9

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Revision (Part II) Ke Chen

LECTURE 23: INFORMATION THEORY REVIEW

Topic Models in Text Processing

Text Categorization Berlin Chen 2003 Reference:

Natural Language Processing (NLP)

Unsupervised Word Sense Disambiguation Using Lesk algorithm

Stance Classification of Ideological Debates

Statistical NLP: Lecture 10

Natural Language Processing (NLP)

Presentation transcript:

Statistical NLP : Lecture 9 Word Sense Disambiguation (Ch 7)

Overview of the Problem Problem: many words have different meanings or senses ==> there is ambiguity about how they are to be interpreted. Task: to determine which of the senses of an ambiguous word is invoked in a particular use of the word. This is done by looking at the context of the word’s use. Note: more often than not the different senses of a word are closely related.

Overview of our Discussion Methodology Supervised Disambiguation: based on a labeled training set. Dictionary-Based Disambiguation: based on lexical resources such as dictionaries and thesauri. Unsupervised Disambiguation: based on unlabeled corpora.

Methodological Preliminaries Supervised versus Unsupervised Learning: in supervised learning the sense label of a word occurrence is known. In unsupervised learning, it is not known. Pseudowords: used to generate artificial evaluation data for comparison and improvements of text-processing algorithms. Upper and Lower Bounds on Performance: used to find out how well an algorithm performs relative to the difficulty of the task.

Supervised Disambiguation Training set: exemplars where each occurrence of the ambiguous word w is annotated with a semantic label ==> Classification problem. Approaches: Bayesian Classification: the context of occurrence is treated as a bag of words without structure, but it integrates information from Information Theory: only looks at informative features in the context. These features may be sensitive to text structure. There are many more approaches (see Chapter 16 and the Senseval competition).

Supervised Disambiguation: Bayesian Classification (I) (Gale et al, 1992)’s idea: to look at the words around an ambiguous word in a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier does no feature selection. Instead, it combines the evidence from all features. Bayes decision rule: Decide s’ if P(s’|C) > P(sk|C) for sk ≠ s’. P(sk|C) is computed by Bayes’ Rule.

Supervised Disambiguation: Bayesian Classification (II) Naïve Bayes assumption: The Naïve Bayes assumption is incorrect in the context of text processing, but it is useful. Decision Rule for Naïve Bayes: Decide s’ if P(vj |sk) and P(sk) are computed via Maximum- Likelihood Estimation, perhaps with appropriate smoothing, from the labeled training corpus.

Supervised Disambiguation: An Information-Theoretic Approach (Brown et al., 1991)’s idea: to find a single contextual feature that reliably indicates whichsense of the ambiguous word is being used. The Flip-Flop algorithm is used to disambiguate between the different senses of a word using the mutual information as a measure. I(X;Y)=Σx∈XΣy∈Yp(x,y) log p(x,y)/(p(x)p(y)) The algorithm works by searching for a partition of senses that maximizes the mutual information. The algorithm stops when the increase becomes insignificant.

Dictionary-Based Disambiguation: Overview We will be looking at three different methods: Disambiguation based on sense definitions Thesaurus-Based Disambiguation Disambiguation based on translations in a second-language corpus Also, we will show how a careful examination of the distributional properties of senses can lead to significant improvements in disambiguation.

Disambiguation based on sense definitions (Lesk, 1986)’s idea: a word’s dictionary definitions are likely to be good indicators for the sense they define. Express the dictionary sub-definitions of the ambiguous word as sets of bag-of-words and the words occurring in the context of the ambiguous word as single bags-of-words emanating from its dictionary definitions (all pooled together). Disambiguate the ambiguous word by choosing the sub-definition of the ambiguous word that has the greatest overlap with the words occurring in its context.

Thesaurus-Based Disambiguation Idea: the semantic categories of the words in a context determine the semantic category of the context as a whole. This category, in turn, determines which word senses are used. (Walker, 87): each word is assigned one or more subject codes which corresponds to its different meanings. For each subject code, we count the number of words (from the context) having the same subject code. We select the subject code corresponding to the highest count. (Yarowski, 92): adapted the algorithm for words that do not occur in the thesaurus but that are very informative. E.g., Navratilova --> Sports

Disambiguation based on translations in a second-language corpus (Dagan & Itai, 91)’s idea: words can be disambiguated by looking at how they are translated in other languages. Example: the word “interest” has two translations in German: 1) “Beteiligung” (legal share--50% a interest in the company) 2) “Interesse” (attention, concern--her interest in Mathematics). To disambiguate the word “interest”, we identify the sentence it occurs in, search a German corpus for instances of the phrase, and assign the meaning associated with the German use of the word in that phrase.

One sense per discourse, one sense per collocation (Yarowsky, 1995)’s Idea: there are constraints between different occurrences of an ambiguous word within a corpus that can be exploited for disambiguation: One sense per discourse: The sense of a target word is highly consistent within any given document. One sense per collocation: nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order and syntactic relationship.

Unsupervised Disambiguation Idea: disambiguate word senses without having recourse to supporting tools such as dictionaries and thesauri and in the absence of labeled text. Simply cluster the contexts of an ambiguous word into a number of groups and discriminate between these groups without labeling them. (Schütze, 1998): The probabilistic model is the same Bayesian model as the one used for supervised classification, but the P(vj | sk) are estimated using the EM algorithm.