Semantic similarity, vector space models and word- sense disambiguation Corpora and Statistical Methods Lecture 6.

Slides:



Advertisements
Similar presentations
Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Albert Gatt Corpora and Statistical Methods Lecture 13.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 23, 2011.
Fall 2001 EE669: Natural Language Processing 1 Lecture 7: Word Sense Disambiguation (Chapter 7 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Semi-Supervised Natural Language Learning Reading Group I set up a site at: ervised/
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Lexical Semantics CSCI-GA.2590 – Lecture 7A
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Word Sense Disambiguation Hsu Ting-Wei Presented by Patty Liu.
Word Sense Disambiguation (WSD)
Natural Language Processing word sense disambiguation Updated 1/12/2005.
Word Sense Disambiguation Many words have multiple meanings –E.g, river bank, financial bank Problem: Assign proper sense to each ambiguous word in text.
Text Classification, Active/Interactive learning.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Word Sense Disambiguation UIUC - 06/10/2004 Word Sense Disambiguation Another NLP working problem for learning with constraints… Lluís Màrquez TALP, LSI,
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Word Sense Disambiguation Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.
Disambiguation Read J & M Chapter 17.1 – The Problem Washington Loses Appeal on Steel Duties Sue caught the bass with the new rod. Sue played the.
Using Semantic Relatedness for Word Sense Disambiguation
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.
Zdroje jazykových dat Word senses Sense tagged corpora.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Word Sense and Subjectivity (Coling/ACL 2006) Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas Acknowledgements: This slide.
Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”
Lecture 15: Text Classification & Naive Bayes
Lecture 21 Computational Lexical Semantics
Statistical NLP: Lecture 9
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Semantic similarity, vector space models and word- sense disambiguation Corpora and Statistical Methods Lecture 6

Word sense disambiguation Part 2

What are word senses? Cognitive definition: mental representation of meaning used in psychological experiments relies on introspection (notoriously deceptive) Dictionary-based definition: adopt sense definitions in a dictionary most frequently used resource is WordNet

WordNet Taxonomic representation of words (“concepts”) Each word belongs to a synset, which contains near- synonyms Each synset has a gloss Words with multiple senses (polysemy) belong to multiple synsets Synsets organised by hyponymy (IS-A) relations

How many senses? Example: interest pay 3% interest on a loan showed interest in something purchased an interest in a company. the national interest… have X’s best interest at heart have an interest in word senses The economy is run by business interests

Wordnet entry for interest (noun) 1. a sense of concern with and curiosity about someone or something … (Synonym: involvement) 2. the power of attracting or holding one’s interest… (Synonym: interestingness) 3. a reason for wanting something done ( Synonym: sake) 4. a fixed charge for borrowing money… 5. a diversion that occupies one’s time and thoughts … (Synonym: pastime) 6. a right or legal share of something; a financial involvement with something (Synonym: stake) 7. (usually plural) a social group whose members control some field of activity and who have common aims, (Synonym: interest group)

Some issues Are all these really distinct senses? Is WordNet too fine- grained? Would native speakers distinguish all these as different? Cf. The distinction between sense ambiguity and underspecification (vagueness): one could argue that there are fewer senses, but these are underspecified out of context

Translation equivalence Many WSD applications rely on translation equivalence Given: parallel corpus (e.g. English-German) if word w in English has n translations in German, then each translation represents a sense e.g. German translations of interest: Zins: financial charge (WN sense 4) Anteil: stake in a company (WN sense 6) Interesse: all other senses

Some terminology WSD Task: given an ambiguous word, find the intended sense in context Sense tagging: task of labelling words as belonging to one sense or another. needs some a priori characterisation of senses of each relevant word Discrimination: distinguishes between occurrences of words based on senses not necessarily explicit labelling

Some more terminology Two types of WSD task: Lexical sample task: focuses on disambiguating a small set of target words, using an inventory of the senses of those words. All-words task: focuses on entire texts and a lexicon, where every word in the text has to be disambiguated Serious data sparseness problems!

Approaches to WSD All methods rely on training data. Basic idea: Given word w in context c learn how to predict sense s of w based on various features of w Supervised learning: training data is labelled with correct senses can do sense tagging Unsupervised learning: training data is unlabelled but many other knowledge sources used cannot do sense tagging, since this requires a priori senses

Supervised learning Words in training data labelled with their senses She pays 3% interest/INTEREST-MONEY on the loan. He showed a lot of interest/INTEREST-CURIOSITY in the painting. Similar to POS tagging given a corpus tagged with senses define features that indicate one sense over another learn a model that predicts the correct sense given the features

Features (e.g. plant) Neighbouring words: plant life manufacturing plant assembly plant plant closure plant species Content words in a larger window animal equipment employee automatic

Other features Syntactically related words e.g. object, subject…. Topic of the text is it about SPORT? POLITICS? Part-of-speech tag, surrounding part-of-speech tags

Some principles proposed (Yarowsky 1995) One sense per discourse: typically, all occurrences of a word will have the same sense in the same stretch of discourse (e.g. same document) One sense per collocation: nearby words provide clues as to the sense, depending on the distance and syntactic relationship e.g. plant life: all (?) occurrences of plant+life will indicate the botanic sense of plant

Training data SENSEVAL Shared Task competition datasets available for WSD, among other things annotated corpora in many languages Pseudo-words create training corpus by artificially conflating words e.g. all occurrences of man and hammer with man-hammer easy way to create training data Multi-lingual parallel corpora translated texts aligned at the sentence level translation indicates sense

Data representation Example sentence: An electric guitar and bass player stand off to one side... Target word: bass Possible senses: fish, musical instrument... Relevant features are represented as vectors, e.g.:

Supervised methods

Naïve Bayes Classifier Identify the features (F) e.g. surrounding words other cues apart from surrounding context Combine evidence from all features Decision rule: decide on sense s’ iff Example: drug. F = words in context medication sense: price, prescription, pharmaceutical illegal substance sense: alcohol, illicit, paraphernalia

Using Bayes’ rule We usually don’t know P(s k |f) but we can compute from training data: P(s k ) (the prior) and P(f|s k ) P(f) can be eliminated because it is constant for all senses in the corpus

The independence assumption It’s called “naïve” because: i.e. all features are assumed to be independent Obviously, this is often not true. e.g. finding illicit in the context of drug may not be independent of finding pusher. cf. our discussion of collocations! Also, topics often constrain word choice.

Training the naive Bayes classifier We need to compute: P(s) for all senses s of w P(f|s) for all features f

Information-theoretic measures Find the single, most informative feature to predict a sense. E.g. using a parallel corpus: prendre (FR) can translate as take or make prendre une décision: make a decision prendre une mesure: take a measure [to…] Informative feature in this case: direct object mesure indicates take décision indicates make Problem: need to identify the correct value of the feature that indicates a specific sense.

Brown et al’s algorithm 1. Given: translations T of word w 2. Given: values X of a useful feature (e.g. mesure, décision as values of DO) 3. Step 1: random partition P of T 4. While improving, do: create partition Q of X that maximises I(P;Q) find a partition P of T that maximises I(P;Q) comment: relies on mutual info to find clusters of translations mapping to clusters of feature values

Using dictionaries and thesauri Lesk (1986): one of the first to exploit dictionary definitions the definition corresponding to a sense can contain words which are good indicators for that sense Method: 1. Given: ambiguous word w with senses s 1 …s n with glosses g 1 …g n. 2. Given: the word w in context c 3. compute overlap between c & each gloss 4. select the maximally matching sense

Expanding a dictionary Problem with Lesk: often dictionary definitions don’t contain sufficient information not all words in dictionary definitions are good informants Solution: use a thesaurus with subject/topic categories e.g. Roget’s thesaurus

Using topic categories Suppose every sense s k of word w has subject/topic t k w can be disambiguated by identifying the words related to t k in the thesaurus Problems: general-purpose thesauri don’t list domain-specific topics several potentially useful words can be left out e.g. … Navratilova plays great tennis … proper name here useful as indicator of topic SPORT

Expanding a thesaurus: Yarowsky Given: context c and topic t 2. For all contexts and topics, compute p(c|t) using Naïve Bayes by comparing words pertaining to t in the thesaurus with words in c if p(c|t) > α, then assign topic t to context c 3. For all words in the vocabulary, update the list of contexts in which the word occurs. Assign topic t to each word in c 4. Finally, compute p(w|t) for all w in the vocabulary this gives the “strength of association” of w with t

Yarowsky 1992: some results SENSEROGET TOPICSACCURACY star space object UNIVERSE 96% celebrity ENTERTAINER95% shapeINSIGNIA82.% sentence punishmentLEGAL_ACTION99% set of wordsGRAMMAR98%

Bootstrapping Yarowsky (1995) suggested the one sense per discourse/collocation constraints. Yarowsky’s method: 1. select the strongest collocational feature in a specific context 2. disambiguate based only on this feature (similar to the information-theoretic method discussed earlier)

One sense per collocation 1. For each sense s of w, initialise F, the collocations found in s’s dictionary definition. 2. One sense per collocation: identify the set of contexts containing collocates of s for each sense s of w, update F to contain those collocates such that for all s’ ≠ s (where alpha is a threshold)

One sense per discourse 3. For each document: find the majority sense of w out of those found in previous step assign all occurrences of w the majority sense This is implemented as a post-processing step. Reduces error rate by ca. 27%.

Unsupervised disambiguation

Preliminaries Recall: unsupervised learning can do sense discrimination not tagging akin to clustering occurrences with the same sense e.g. Brown et al 1991: cluster translations of a word this is akin to clustering senses

Brown et al’s method Preliminary categorisation: 1. Set P(w|s) randomly for all words w and senses s of w. 2. Compute, for each context c of w the probability P(c|s) that the context was generated by sense s. Use (1) and (2) as a preliminary estimate. Re-estimate iteratively to find best fit to the corpus.

Characteristics of unsupervised disambiguation Can adapt easily to new domains, not covered by a dictionary or pre-labelled corpus Very useful for information retrieval If there are many senses (e.g. 20 senses for word w), the algorithm will split contexts into fine-grained sets NB: can go awry with infrequent senses

Some issues with WSD

The task definition The WSD task traditionally assumes that a word has one and only one sense in a context. Is this true? Kilgarriff (1993) argues that co-activation (one word displaying more than one sense) is frequent: this would bring competition to the licensed trade competition = “act of competing”; “people/organisations who are competing”

Systematic polysemy Not all senses are so easy to distinguish. E.g. competition in the “agent competing” vs “act of competing” sense. The polysemy here is systematic Compare bank/bank where the senses are utterly distinct (and most linguists wouldn’t consider this a case of polysemy, but homonymy) Can translation equivalence help here? depends if polysemy is systematic in all languages

Logical metonymy Metonymy = usage of a word to stand for something else e.g. the pen is mightier than the sword pen = the press Logical metonymy arises due to systematic polysemy good cook vs. good book enjoy the paper vs enjoy the cake Should WSD distinguish these? How could they do this?

Which words/usages count? Many proper names are identical to common nouns (cf. Brown, Bush,…) This presents a WSD algorithm with systematic ambiguity and reduces performance. Also, names are good indicators of senses of neighbouring words. But this requires a priori categorisation of names. Brown’s green stance vs. the cook’s green curry