Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

Slides:



Advertisements
Similar presentations
Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Semantic similarity, vector space models and word- sense disambiguation Corpora and Statistical Methods Lecture 6.
Fall 2001 EE669: Natural Language Processing 1 Lecture 7: Word Sense Disambiguation (Chapter 7 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.
Semi-Supervised Natural Language Learning Reading Group I set up a site at: ervised/
Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Learning Chapter 18 and Parts of Chapter 20
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Natural Language Understanding
Semi-Supervised Learning
Albert Gatt Corpora and Statistical Methods Lecture 9.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Bayesian Networks. Male brain wiring Female brain wiring.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.
Word Sense Disambiguation Hsu Ting-Wei Presented by Patty Liu.
Word Sense Disambiguation (WSD)
Natural Language Processing word sense disambiguation Updated 1/12/2005.
Word Sense Disambiguation Many words have multiple meanings –E.g, river bank, financial bank Problem: Assign proper sense to each ambiguous word in text.
Text Classification, Active/Interactive learning.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
Text Clustering.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Word Sense Disambiguation Minho Kim Foundation of Statistical Natural Language Processing.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”
Coarse-grained Word Sense Disambiguation
Lecture 15: Text Classification & Naive Bayes
Word Sense Disambiguation
Category-Based Pseudowords
Statistical NLP: Lecture 9
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Text Categorization Berlin Chen 2003 Reference:
Parametric Methods Berlin Chen, 2005 References:
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Word Sense Disambiguation Kyung-Hee Sung Foundations of Statistical NLP Chapter 7

2 Contents  Methodological Preliminaries  Supervised Disambiguation –Bayesian classification / An information-theoretic approach –Disambiguation based on dictionaries, thesauri and bilingual dictionaries –One sense per discourse, one sense per collocation  Unsupervised Disambiguation

3 Introduction  Word sense disambiguation –Many words have several meanings or senses. –Many words have different usages. Ex) Butter may be used as noun, or as a verb. –The task of disambiguation is done by looking at the context of the word’s use.

4 Methodological Preliminaries (1/2) Supervised learningUnsupervised learning  We know the actual status (sense label) for each piece of data on which we train. (Learning from labeled data)  Classification task  We do not know the classification of the data in the training sample. (Learning from unlabeled data)  Clustering task

5 Methodological Preliminaries (2/2)  Pseudowards : artificial ambiguous words –In order to test the performance of disambiguation algorithms. Ex) banana-door  Performance estimation –Upper bound : human performance –Lower bound (baseline) : the performance of the simplest possible algorithm, usually the assignment of all contexts to the most frequent sense.

6 Supervised Disambiguation Bayesian classificationInformation Theory  It treats the context of occurrence as a bag of words without structure.  It integrates information from all words in the context window.  It looks at only one informative feature in the context, which may be sensitive to text structure.

7 Notations SymbolMeaning wan ambiguous word s 1, …, s k, …, s K senses of the ambiguous word w (a semantic label of w) c 1, …, c i, …, c I contexts of w in a corpus v 1, …, v j, …, v J words used as contextual features for disambiguation

8 Bayesian classification (1/2)  Assumption : we have a corpus where each use of ambiguous words is labeled with its correct sense.  Bayes decision rule : minimizes the probability of errors –Decide s´ if P(s´| c) > P(s k | c) ← using Bayes’s rule ← P(c) is constant for all senses

9 Bayesian classification (2/2)  Naive Bayes independence assumption –All the structure and linear ordering of words within the context is ignored. → bag of words model –The presence of one word in the bag is independent of another. –  Decision rule for Naive Bayes –Decide s´ if ← computed by MLE

10 An Information-theoretic approach (1/2)  The Flip-Flop algorithm applied to finding indicators for disambiguation between two senses. 1find random partition P = {P 1, P 2 } of {t 1, … t m } 2while (improving) do 3find partition Q = {Q 1, Q 2 } of {x 1, … x m } 4that maximize I(P; Q) 5find partition P = {P 1, P 2 } of {t 1, … t m } 6that maximize I(P; Q) 7end

11 An Information-theoretic approach (2/2)  To translate prendre (French) based on its object –Translation, {t 1, … t m } = { take, make, rise, speak } –Values of Indicator, {x 1, … x m } = { mesure, note, exemple, décision, parole } 1. Initial partitionP 1 = { take, rise } P 2 = { make, speak } 2. Find partition Q 1, = { mesure, note, exemple} Q 2, = { décision, parole } ← This division gives us the most information for distinguishing P 1 from P 2 (maximizes I(P; Q)) 3. Find partition P 1, = { take } P 2, = { make, rise, speak } ← Always collect for take. Relations (English) take a measure take notes take an example make a decision make a speech rise to speak

12 Dictionary-based disambiguation (1/2)  A word’s dictionary definitions are good indicators for the senses they define. for all senses s k of w do score(s k ) = overlap (D k, U vj in c E vj ) // number of common words end choose s´ s.t. s´= argmax sk score(s k ) SymbolMeaning D 1, …, D K dictionary definitions of the senses s 1, …, s K s j1, …, s jL senses of v j E vj dictionary definitions of a word v j / E vj = U ji D ji

13 Dictionary-based disambiguation (2/2)  Simplified example : ash –The score is number of words that are shared by the sense definition and the context. SenseDefinition s1s1 treea tree of the olive family s2s2 burned stuff the solid residue left when combustible material is burned Scores Context s1s1 s2s2 01This cigar burns slowly and creates a stiff ash. 10The ash is on of the last trees to come into leaf.

14 Thesaurus-based disambiguation (1/2)  Semantic categories of the words in a context determine the semantic categories of the context as a whole, and that this category in turn determines which word senses are used.  Each word is assigned one or more subject codes in the dictionary. –t(s k ) : subject code of sense s k. –The score is the number of words that compatible with the subject code of sense s k. for all senses s k of w do score(s k ) = Σ vj in c δ ( t(s k ), v j ) end choose s´ s.t. s´= argmax sk score(s k )

15 Thesaurus-based disambiguation (2/2) WordSenseRoget categoryAccuracy bass[beis] musical [bæs] fish MUSIC ANIMAL, INSECT 99% 100% interestcuriosity advantage financial share RESIONING INJUSTICE DEBT PROPERTY 88% 34% 90% 38%  Self-interest (advantage) is not topic-specific.  When a sense is spread out over several topics, the topic- based classification algorithm fails.

16 Disambiguation based on translations in a second-language corpus  In order to disambiguate an occurrence of interest in English (first language), we identify the phrase it occurs in and search a German (second language) corpus for instances of the phrase. –The English phrase showed interest : show(E) → ‘zeigen’(G) –‘zeigen’(G) will only occur with Interesse(G) since ‘legal shares’ are usually not shown. –We can conclude that interest in the phrase to show interest belongs to the sense attention, concern.

17 One sense per discourse constraint  The sense of a target word is highly consistent within any given document. –If the first occurrence of plant is a use of the sense ‘living being’, then later occurrences are likely to refer to living beings too. for all documents d m do determine the majority sense s k of w in d m assign all occurrences of w in d m to s k end

18 One sense per collocation constraint  Word senses are strongly correlated with certain contextual features like other words in the same phrasal unit. –Collocational features are ranked according to the ratio: –Relying on only the strongest feature has the advantage that no integration of different sources of evidence is necessary. ← The number of occurrences of sense s k with collocation f m

19 Unsupervised Disambiguation (1/3)  There are situations in which even such a small amount of information is not available.  Sense tagging requires that some characterization of the senses be provided. However, sense discrimination can be performed in a completely unsupervised fashion.  Context-group discrimination : a completely unsupervised algorithm that clusters unlabeled occurrences.

20 An EM algorithm (1/2) 1.Initialize the parameters of the model μ randomly. Compute the log likelihood of the corpus C 2. While l(C|μ) is improving repeat: (a) E-step. Estimate h ik ← Naive Bayes assumption

21 An EM algorithm (2/2) (b) M-step. Re-estimate the parameters P(v j |s k ) and P(s k ) by way of MLE

22 Unsupervised Disambiguation (2/3)  Unsupervised disambiguation can be easily adapted to produce distinctions between usage types. –Ex) The distinction between physical bank ( in the context of bank robberies ) banks as abstract corporations ( in the context of corporate mergers )  The unsupervised algorithm splits dictionary senses into fine-grained contextual variants. –Usually, the induced clusters do not line up well with dictionary senses. Ex) ‘lawsuit’ → ‘civil suit’, ‘criminal suit’

23 Unsupervised Disambiguation (3/3)  Infrequent senses and senses that have few collocations are hard to isolate in unsupervised disambiguation.  Results of the EM algorithm –The algorithm fails for words whose senses are topic-independent such as ‘to teach’ for train. WordSense Accuracy Meanσ suitlawsuit the suit you wear trainline of railroad cars to teach ← for ten experiments with different initial conditions