Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 23, 2011.

Slides:

Advertisements

Similar presentations

Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”

Advertisements

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Introduction to Information Retrieval

Semi-Supervised Learning & Summary Advanced Statistical Methods in NLP Ling 572 March 8, 2012.

Semantic News Recommendation Using WordNet and Bing Similarities 28th Symposium On Applied Computing 2013 (SAC 2013) March 21, 2013 Michel Capelle

1 Extended Gloss Overlaps as a Measure of Semantic Relatedness Satanjeev Banerjee Ted Pedersen Carnegie Mellon University University of Minnesota Duluth.

Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -

Measures of Text Similarity

Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 28, 2011.

Lexical Semantics & Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 16, 2011.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

Faculty Of Applied Science Simon Fraser University Cmpt 825 presentation Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary Jiri.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

LSA 311 Computational Lexical Semantics Dan Jurafsky Stanford University Lecture 2: Word Sense Disambiguation.

Semi-Supervised Natural Language Learning Reading Group I set up a site at: ervised/

Using Information Content to Evaluate Semantic Similarity in a Taxonomy Presenter: Cosmin Adrian Bejan Philip Resnik Sun Microsystems Laboratories.

1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

Word Sense Disambiguation. Word Sense Disambiguation (WSD) Given A word in context A fixed inventory of potential word senses Decide which sense of the.

SI485i : NLP Set 10 Lexical Relations slides adapted from Dan Jurafsky and Bill MacCartney.

Final review LING572 Fei Xia Week 10: 03/11/

Lexical Semantics CSCI-GA.2590 – Lecture 7A

Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.

Corpus-Based Approaches to Word Sense Disambiguation

Bayesian Networks. Male brain wiring Female brain wiring.

Word Sense Disambiguation (WSD)

Word Sense Disambiguation Many words have multiple meanings –E.g, river bank, financial bank Problem: Assign proper sense to each ambiguous word in text.

Text Classification, Active/Interactive learning.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.

W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Lexical Semantics & Word Sense Disambiguation CMSC Natural Language Processing May 15, 2003.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Word Sense Disambiguation Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

HyperLex: lexical cartography for information retrieval Jean Veronis Presented by: Siddhanth Jain( ) Samiulla Shaikh( )

Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.

Disambiguation Read J & M Chapter 17.1 – The Problem Washington Loses Appeal on Steel Duties Sue caught the bass with the new rod. Sue played the.

Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarityReadings: NLTK book Chapter.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Using Semantic Relatedness for Word Sense Disambiguation

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.

1 Measuring the Semantic Similarity of Texts Author ： Courtney Corley and Rada Mihalcea Source ： ACL-2005 Reporter ： Yong-Xiang Chen.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Graph-based WSD の続き DMLA /7/10 小町守.

Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Lecture 18: Uniformity Testing Monotonicity Testing

Lecture 21 Computational Lexical Semantics

LECTURE 05: THRESHOLD DECODING

Statistical NLP: Lecture 9

LECTURE 05: THRESHOLD DECODING

Introduction Task: extracting relational facts from text

Text Categorization Berlin Chen 2003 Reference:

LECTURE 05: THRESHOLD DECODING

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 23, 2011

Word Sense Disambiguation Robust Approaches Supervised Learning Approaches Naïve Bayes Dictionary-based Approaches Bootstrapping Approaches One sense per discourse/collocation Similarity-based approaches Thesaurus-based techniques Unsupervised approaches Why they work, Why they don’t

Disambiguation Features Key: What are the features? Part of speech Of word and neighbors Morphologically simplified form Words in neighborhood Question: How big a neighborhood? Is there a single optimal size? Why? (Possibly shallow) Syntactic analysis E.g. predicate-argument relations, modification, phrases Collocation vs co-occurrence features Collocation: words in specific relation: p-a, 1 word +/- Co-occurrence: bag of words..

WSD Evaluation Ideally, end-to-end evaluation with WSD component Demonstrate real impact of technique in system Difficult, expensive, still application specific Typically, intrinsic, sense-based Accuracy, precision, recall SENSEVAL/SEMEVAL: all words, lexical sample Baseline: Most frequent sense, Lesk Topline: Human inter-rater agreement: 75-80% fine; 90% coarse

Naïve Bayes’ Approach Supervised learning approach Input: feature vector X label

Naïve Bayes’ Approach Supervised learning approach Input: feature vector X label Best sense = most probable sense given f

Naïve Bayes’ Approach Supervised learning approach Input: feature vector X label Best sense = most probable sense given f

Naïve Bayes’ Approach Issue: Data sparseness: full feature vector rarely seen

Naïve Bayes’ Approach Issue: Data sparseness: full feature vector rarely seen “Naïve” assumption: Features independent given sense

Training NB Classifier Estimate P(s): Prior

Training NB Classifier Estimate P(s): Prior

Training NB Classifier Estimate P(s): Prior Estimate P(f j |s)

Training NB Classifier Estimate P(s): Prior Estimate P(f j |s) Issues:

Training NB Classifier Estimate P(s): Prior Estimate P(f j |s) Issues: Underflow => log prob

Training NB Classifier Estimate P(s): Prior Estimate P(f j |s) Issues: Underflow => log prob Sparseness => smoothing

Dictionary-Based Approach (Simplified) Lesk algorithm “How to tell a pine cone from an ice cream cone”

Dictionary-Based Approach (Simplified) Lesk algorithm “How to tell a pine cone from an ice cream cone” Compute context ‘signature’ of word to disambiguate Words in surrounding sentence(s)

Dictionary-Based Approach (Simplified) Lesk algorithm “How to tell a pine cone from an ice cream cone” Compute context ‘signature’ of word to disambiguate Words in surrounding sentence(s) Compare overlap w.r.t. dictionary entries for senses

Dictionary-Based Approach (Simplified) Lesk algorithm “How to tell a pine cone from an ice cream cone” Compute context ‘signature’ of word to disambiguate Words in surrounding sentence(s) Compare overlap w.r.t. dictionary entries for senses Select sense with highest (non-stopword) overlap

Applying Lesk The bank can guarantee deposits will eventually cover future tuition costs because it invests in mortgage securities.

Applying Lesk The bank can guarantee deposits will eventually cover future tuition costs because it invests in mortgage securities. Bank 1 : 2

Applying Lesk The bank can guarantee deposits will eventually cover future tuition costs because it invests in mortgage securities. Bank 1 : 2 Bank 2 : 0

Improving Lesk Overlap score: All words equally weighted (excluding stopwords)

Improving Lesk Overlap score: All words equally weighted (excluding stopwords) Not all words equally informative

Improving Lesk Overlap score: All words equally weighted (excluding stopwords) Not all words equally informative Overlap with unusual/specific words – better Overlap with common/non-specific words – less good

Improving Lesk Overlap score: All words equally weighted (excluding stopwords) Not all words equally informative Overlap with unusual/specific words – better Overlap with common/non-specific words – less good Employ corpus weighting: IDF: inverse document frequency Idf i = log (Ndoc/nd i )

Minimally Supervised WSD Yarowsky’s algorithm (1995) Bootstrapping approach: Use small labeled seedset to iteratively train

Minimally Supervised WSD Yarowsky’s algorithm (1995) Bootstrapping approach: Use small labeled seedset to iteratively train Builds on 2 key insights: One Sense Per Discourse Word appearing multiple times in text has same sense Corpus of bass instances: always single sense

Minimally Supervised WSD Yarowsky’s algorithm (1995) Bootstrapping approach: Use small labeled seedset to iteratively train Builds on 2 key insights: One Sense Per Discourse Word appearing multiple times in text has same sense Corpus of bass instances: always single sense One Sense Per Collocation Local phrases select single sense Fish -> Bass 1 Play -> Bass 2

Yarowsky’s Algorithm Training Decision Lists 1. Pick Seed Instances & Tag

Yarowsky’s Algorithm Training Decision Lists 1. Pick Seed Instances & Tag 2. Find Collocations: Word Left, Word Right, Word +K

Yarowsky’s Algorithm Training Decision Lists 1. Pick Seed Instances & Tag 2. Find Collocations: Word Left, Word Right, Word +K (A) Calculate Informativeness on Tagged Set, Order:

Yarowsky’s Algorithm Training Decision Lists 1. Pick Seed Instances & Tag 2. Find Collocations: Word Left, Word Right, Word +K (A) Calculate Informativeness on Tagged Set, Order: (B) Tag New Instances with Rules

Yarowsky’s Algorithm Training Decision Lists 1. Pick Seed Instances & Tag 2. Find Collocations: Word Left, Word Right, Word +K (A) Calculate Informativeness on Tagged Set, Order: (B) Tag New Instances with Rules (C) Apply 1 Sense/Discourse (D)

Yarowsky’s Algorithm Training Decision Lists 1. Pick Seed Instances & Tag 2. Find Collocations: Word Left, Word Right, Word +K (A) Calculate Informativeness on Tagged Set, Order: (B) Tag New Instances with Rules (C) Apply 1 Sense/Discourse (D) If Still Unlabeled, Go To 2

Yarowsky’s Algorithm Training Decision Lists 1. Pick Seed Instances & Tag 2. Find Collocations: Word Left, Word Right, Word +K (A) Calculate Informativeness on Tagged Set, Order: (B) Tag New Instances with Rules (C) Apply 1 Sense/Discourse (D) If Still Unlabeled, Go To 2 3. Apply 1 Sense/Discourse

Yarowsky’s Algorithm Training Decision Lists 1. Pick Seed Instances & Tag 2. Find Collocations: Word Left, Word Right, Word +K (A) Calculate Informativeness on Tagged Set, Order: (B) Tag New Instances with Rules (C) Apply 1 Sense/Discourse (D) If Still Unlabeled, Go To 2 3. Apply 1 Sense/Discourse Disambiguation: First Rule Matched

Yarowsky Decision List

Iterative Updating

There are more kinds of plants and animals in the rainforests than anywhere else on Earth. Over half of the millions of known species of plants and animals live in the rainforest. Many are found nowhere else. There are even plants and animals in the rainforest that we have not yet discovered. Biological Example The Paulus company was founded in Since those days the product range has been the subject of constant expansions and is brought up continuously to correspond with the state of the art. We ’ re engineering, manufacturing and commissioning world- wide ready-to-run plants packed with our comprehensive know- how. Our Product Range includes pneumatic conveying systems for carbon, carbide, sand, lime andmany others. We use reagent injection in molten metal for the… Industrial Example Label the First Use of “ Plant ”

Sense Choice With Collocational Decision Lists Create Initial Decision List Rules Ordered by Check nearby Word Groups (Collocations) Biology: “Animal” in words Industry: “Manufacturing” in words Result: Correct Selection 95% on Pair-wise tasks

Word Similarity Synonymy:

Word Similarity Synonymy: True propositional substitutability is rare, slippery

Word Similarity Synonymy: True propositional substitutability is rare, slippery Word similarity (semantic distance): Looser notion, more flexible

Word Similarity Synonymy: True propositional substitutability is rare, slippery Word similarity (semantic distance): Looser notion, more flexible Appropriate to applications: IR, summarization, MT, essay scoring

Word Similarity Synonymy: True propositional substitutability is rare, slippery Word similarity (semantic distance): Looser notion, more flexible Appropriate to applications: IR, summarization, MT, essay scoring Don’t need binary +/- synonym decision

Word Similarity Synonymy: True propositional substitutability is rare, slippery Word similarity (semantic distance): Looser notion, more flexible Appropriate to applications: IR, summarization, MT, essay scoring Don’t need binary +/- synonym decision Want terms/documents that have high similarity

Word Similarity Synonymy: True propositional substitutability is rare, slippery Word similarity (semantic distance): Looser notion, more flexible Appropriate to applications: IR, summarization, MT, essay scoring Don’t need binary +/- synonym decision Want terms/documents that have high similarity Differ from relatedness

Word Similarity Synonymy: True propositional substitutability is rare, slippery Word similarity (semantic distance): Looser notion, more flexible Appropriate to applications: IR, summarization, MT, essay scoring Don’t need binary +/- synonym decision Want terms/documents that have high similarity Differ from relatedness Approaches: Thesaurus-based Distributional

Thesaurus-based Techniques Key idea: Shorter path length in thesaurus, smaller semantic dist.

Thesaurus-based Techniques Key idea: Shorter path length in thesaurus, smaller semantic dist. Words similar to parents, siblings in tree Further away, less similar

Thesaurus-based Techniques Key idea: Shorter path length in thesaurus, smaller semantic dist. Words similar to parents, siblings in tree Further away, less similar Pathlength=# edges in shortest route in graph b/t nodes

Thesaurus-based Techniques Key idea: Shorter path length in thesaurus, smaller semantic dist. Words similar to parents, siblings in tree Further away, less similar Pathlength=# edges in shortest route in graph b/t nodes Sim path = -log pathlen(c 1,c 2 ) [Leacock & Chodorow]

Thesaurus-based Techniques Key idea: Shorter path length in thesaurus, smaller semantic dist. Words similar to parents, siblings in tree Further away, less similar Pathlength=# edges in shortest route in graph b/t nodes Sim path = -log pathlen(c 1,c 2 ) [Leacock & Chodorow] Problem 1:

Thesaurus-based Techniques Key idea: Shorter path length in thesaurus, smaller semantic dist. Words similar to parents, siblings in tree Further away, less similar Pathlength=# edges in shortest route in graph b/t nodes Sim path = -log pathlen(c 1,c 2 ) [Leacock & Chodorow] Problem 1: Rarely know which sense, and thus which node

Thesaurus-based Techniques Key idea: Shorter path length in thesaurus, smaller semantic dist. Words similar to parents, siblings in tree Further away, less similar Pathlength=# edges in shortest route in graph b/t nodes Sim path = -log pathlen(c 1,c 2 ) [Leacock & Chodorow] Problem 1: Rarely know which sense, and thus which node Solution: assume most similar senses estimate

Thesaurus-based Techniques Key idea: Shorter path length in thesaurus, smaller semantic dist. Words similar to parents, siblings in tree Further away, less similar Pathlength=# edges in shortest route in graph b/t nodes Sim path = -log pathlen(c 1,c 2 ) [Leacock & Chodorow] Problem 1: Rarely know which sense, and thus which node Solution: assume most similar senses estimate Wordsim(w 1,w 2) = max sim(c 1,c 2 )

Path Length Path length problem:

Path Length Path length problem: Links in WordNet not uniform Distance 5: Nickel->Money and Nickel->Standard

Resnik’s Similarity Measure Solution 1:

Resnik’s Similarity Measure Solution 1: Build position-specific similarity measure

Resnik’s Similarity Measure Solution 1: Build position-specific similarity measure Not general Solution 2:

Resnik’s Similarity Measure Solution 1: Build position-specific similarity measure Not general Solution 2: Add corpus information: information-content measure P(c) : probability that a word is instance of concept c

Resnik’s Similarity Measure Solution 1: Build position-specific similarity measure Not general Solution 2: Add corpus information: information-content measure P(c) : probability that a word is instance of concept c Words(c) : words subsumed by concept c; N: words in corpus

Resnik’s Similarity Measure Information content of node: IC(c) = -log P(c)

Resnik’s Similarity Measure Information content of node: IC(c) = -log P(c) Least common subsumer (LCS): Lowest node in hierarchy subsuming 2 nodes

Resnik’s Similarity Measure Information content of node: IC(c) = -log P(c) Least common subsumer (LCS): Lowest node in hierarchy subsuming 2 nodes Similarity measure: sim RESNIK (c 1,c 2 ) = - log P(LCS(c 1,c 2 ))

Resnik’s Similarity Measure Information content of node: IC(c) = -log P(c) Least common subsumer (LCS): Lowest node in hierarchy subsuming 2 nodes Similarity measure: sim RESNIK (c 1,c 2 ) = - log P(LCS(c 1,c 2 )) Issue:

Resnik’s Similarity Measure Information content of node: IC(c) = -log P(c) Least common subsumer (LCS): Lowest node in hierarchy subsuming 2 nodes Similarity measure: sim RESNIK (c 1,c 2 ) = - log P(LCS(c 1,c 2 )) Issue: Not content, but difference between node & LCS

Resnik’s Similarity Measure Information content of node: IC(c) = -log P(c) Least common subsumer (LCS): Lowest node in hierarchy subsuming 2 nodes Similarity measure: sim RESNIK (c 1,c 2 ) = - log P(LCS(c 1,c 2 )) Issue: Not content, but difference between node & LCS