CS626-460: Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 12: WSD approaches (contd.)

Slides:

Advertisements

Similar presentations

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Albert Gatt Corpora and Statistical Methods Lecture 13.

1 Extended Gloss Overlaps as a Measure of Semantic Relatedness Satanjeev Banerjee Ted Pedersen Carnegie Mellon University University of Minnesota Duluth.

Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 23, 2011.

Faculty Of Applied Science Simon Fraser University Cmpt 825 presentation Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary Jiri.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

1 Lecture 8 Measures of association: chi square test, mutual information, binomial distribution and log likelihood ratio.

CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.

WSD using Optimized Combination of Knowledge Sources Authors: Yorick Wilks and Mark Stevenson Presenter: Marian Olteanu.

Albert Gatt Corpora and Statistical Methods Lecture 9.

WORD SENSE DISAMBIGUATION By Mitesh M. Khapra Under the guidance of Prof. Pushpak Bhattacharyya.

Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.

WORD SENSE DISAMBIGUATION Presented By Roshan R. Karwa Guided By Dr. M.B.Chandak A Technical Seminar on.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Bayesian Networks. Male brain wiring Female brain wiring.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

Word Sense Disambiguation (WSD)

Word Sense Disambiguation Many words have multiple meanings –E.g, river bank, financial bank Problem: Assign proper sense to each ambiguous word in text.

Text Classification, Active/Interactive learning.

Word Sense Disambiguation UIUC - 06/10/2004 Word Sense Disambiguation Another NLP working problem for learning with constraints… Lluís Màrquez TALP, LSI,

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.

W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Word Sense Disambiguation Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

HyperLex: lexical cartography for information retrieval Jean Veronis Presented by: Siddhanth Jain( ) Samiulla Shaikh( )

Classification Techniques: Bayesian Classification

Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.

Disambiguation Read J & M Chapter 17.1 – The Problem Washington Loses Appeal on Steel Duties Sue caught the bass with the new rod. Sue played the.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 24 (14/04/06) Prof. Pushpak Bhattacharyya IIT Bombay Word Sense Disambiguation.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.

Pushpak Bhattacharyya CSE Dept., IIT Bombay 20th Jan, 2011

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Post-Ranking query suggestion by diversifying search Chao Wang.

11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.

1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.

1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.

Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Statistical Methods in NLP Course 10 Diana Trandab ă ț

Graph-based WSD の続き DMLA /7/10 小町守.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Lecture 15: Text Classification & Naive Bayes

Lecture 21 Computational Lexical Semantics

Statistical NLP: Lecture 9

A method for WSD on Unrestricted Text

Unsupervised Word Sense Disambiguation Using Lesk algorithm

Unsupervised learning of visual sense models for Polysemous words

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 12: WSD approaches (contd.)

2 The jury(2 ) praised the administration(3) and operation (8) of Atlanta Police Department(1) ‏ Step 1: Make a lattice of the nouns in the context, their senses and hypernyms. Step 2: Compute the conceptual density of resultant concepts (sub-hierarchies). Step 3: The concept with highest CD is selected. Step 4: Select the senses below the selected concept as the correct sense for the respective words. operation division administrative_unit jury committee police department local department government department department juryadministration body CD = CD = CFILT - IITB C ONCEPTUAL D ENSITY (E XAMPLE ) ‏

3 C ONCEPTUAL D ENSITY F ORMULA 3 CFILT - IITB Wish list The conceptual distance between two words should be proportional to the length of the path between the two words in the hierarchical tree (WordNet). The conceptual distance between two words should be proportional to the depth of the concepts in the hierarchy. where, c = concept nhyp = mean number of hyponyms h = height of the sub-hierarchy m = no. of senses of the word and senses of context words contained in the sub- hierarchy CD= Conceptual Density entity financelocation moneybank-1bank-2 d (depth) h (height) of the concept “location” Sub-Tree Conceptual Distance(S1, S2) = Length of the path between(S1, S2) Height (lowest common ancestor (L1, L2))

4 WSD U SING R ANDOM W ALK A LGORITHM S3 Bell ring church Sunday S2 S1 S3 S2 S1 S3 S2 S1 a c b e f g h i j k l 0.46 a Step 1: Add a vertex for each possible sense of each word in the text. Step 2: Add weighted edges using definition based semantic similarity (Lesk’s method). Step 3: Apply graph based ranking algorithm to find score of each vertex (i.e. for each word sense). Step 4: Select the vertex (sense) which has the highest score. 4 CFILT - IITB

Page and Brin, 1998 “We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.” 12/05/08 CFILT - IITB

Page and Brin 1998 (contd.) “PageRank can be thought of as a model of user behavior. We assume there is a "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page.” 12/05/08 CFILT - IITB

KB APPROACHES – COMPARISONS 7 CFILT - IITB AlgorithmAccuracy WSD using Selectional Restrictions44% on Brown Corpus Lesk’s algorithm50-60% on short samples of “Pride and Prejudice” and some “news stories”. WSD using conceptual density54% on Brown corpus. WSD using Random Walk Algorithms54% accuracy on SEMCOR corpus which has a baseline accuracy of 37%. Walker’s algorithm50% when tested on 10 highly polysemous English words.

KB APPROACHES –Observations 8 CFILT - IITB Drawbacks of WSD using Selectional Restrictions Needs exhaustive Knowledge Base. Drawbacks of Overlap based approaches Dictionary definitions are generally very small. Dictionary entries rarely take into account the distributional constraints of different word senses (e.g. selectional preferences, kinds of prepositions, etc.  c igarette and ash never co-occur in a dictionary). Suffer from the problem of sparse match. Proper nouns are not present in a MRD. Hence these approaches fail to capture the strong clues provided by proper nouns. E.g. “Sachin Tendulkar” will be a strong indicator of the category “sports”. Sachin Tendulkar plays cricket.

9 Topics to be covered Knowledge Based Approaches WSD using Selectional Preferences (or restrictions) ‏ Overlap Based Approaches Machine Learning Based Approaches Supervised Approaches Semi-supervised Algorithms Unsupervised Algorithms Hybrid Approaches 9 CFILT - IITB

10 N AÏVE B AYES 10 CFILT - IITB sˆ= argmax s ε senses Pr(s| V w ) ‏ ‘ V w ’ is a feature vector consisting of: POS of w Semantic & Syntactic features of w Collocations (set of words around it)  typically consists of next word(+1), next-to-next word(+2), -2, -1 & their POS's Co-occurrence stats (number of times words occurs in bag of words around it) ‏ Applying Bayes rule and naive independence assumption sˆ= argmax s ε senses Pr(s).Π i=1 n Pr(V w i |s) ‏

11 E STIMATING P ARAMETERS Parameters in the probabilistic WSD are: Pr(s) Pr(V w i |s) Senses are marked with respect to sense repository (WORDNET) Pr(s)= count(s) / count(all senses) Pr(V w i |s)= Pr(V w i,s)/Pr(s) = c(V w i,s,w)/c(s,w) CFILT - IITB

12 D ECISION L IST A LGORITHM Collect a large set of collocations for the ambiguous word. Calculate word-sense probability distributions for all such collocations. Calculate the log-likelihood ratio Higher log-likelihood = more predictive evidence Collocations are ordered in a decision list, with most predictive collocations ranked highest. 12 CFILT - IITB Pr(Sense-A| Collocation i )‏ Pr(Sense-B| Collocation i )‏ Log( ) ‏ 12 CFILT - IITB Assuming there are only two senses for the word. Of course, this can easily be extended to ‘k’ senses.

13 Training DataResultant Decision List D ECISION L IST A LGORITHM (C ONTD.) ‏ Classification of a test sentence is based on the highest ranking collocation found in the test sentence. E.g. … plucking flowers affects plant growth … 13 CFILT - IITB

EXEMPLAR BASED WSD (K-NN) ‏ An exemplar based classifier is constructed for each word to be disambiguated. Step1: From each sense marked sentence containing the ambiguous word, a training example is constructed using: POS of w as well as POS of neighboring words. Local collocations Co-occurrence vector Morphological features Subject-verb syntactic dependencies Step2: Given a test sentence containing the ambiguous word, a test example is similarly constructed. Step3: The test example is then compared to all training examples and the k-closest training examples are selected. Step4: The sense which is most prevalent amongst these “k” examples is then selected as the correct sense. 14 CFILT - IITB

WSD USING SVM S SVM is a binary classifier which finds a hyperplane with the largest margin that separates training examples into 2 classes. As SVMs are binary classifiers, a separate classifier is built for each sense of the word Training Phase: Using a tagged corpus, f or every sense of the word a SVM is trained using the following features: POS of w as well as POS of neighboring words. Local collocations Co-occurrence vector Features based on syntactic relations (e.g. headword, POS of headword, voice of head word etc.) Testing Phase: Given a test sentence, a test example is constructed using the above features and fed as input to each binary classifier. The correct sense is selected based on the label returned by each classifier. 15 CFILT - IITB

SUPERVISED APPROACHES – COMPARISONS 16 CFILT - IITB ApproachAverage Precision Average RecallCorpusAverage Baseline Accuracy Naïve Bayes64.13%Not reportedSenseval3 – All Words Task 60.90% Decision Lists96%Not applicableTested on a set of 12 highly polysemous English words 63.9% Exemplar Based disambiguation (k- NN)‏ 68.6%Not reportedWSJ6 containing 191 content words 63.7% SVM72.4% Senseval 3 – Lexical sample task (Used for disambiguation of 57 words)‏ 55.2%

SUPERVISED APPROACHES – Observations 17 CFILT - IITB General Comments Use corpus evidence instead of relying of dictionary defined senses. Can capture important clues provided by proper nouns because proper nouns do appear in a corpus. Naïve Bayes Suffers from data sparseness. Since the scores are a product of probabilities, some weak features might pull down the overall score for a sense. A large number of parameters need to be trained. Decision Lists A word-specific classifier. A separate classifier needs to be trained for each word. Uses the single most predictive feature which eliminates the drawback of Naïve Bayes.

SUPERVISED APPROACHES – Observations 18 CFILT - IITB Exemplar Based K-NN A word-specific classifier. Will not work for unknown words which do not appear in the corpus. Uses a diverse set of features (including morphological and noun- subject-verb pairs) ‏ SVM A word-sense specific classifier. Gives the highest improvement over the baseline accuracy. Uses a diverse set of features.

Some Indian thought (7 century A.D.) 12/05/08 CFILT - IITB

12/05/08 CFILT - IITB