Lecture 22 Word Similarity

Slides:

Advertisements

Similar presentations

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

Advertisements

1 Extended Gloss Overlaps as a Measure of Semantic Relatedness Satanjeev Banerjee Ted Pedersen Carnegie Mellon University University of Minnesota Duluth.

SI485i : NLP Set 11 Distributional Similarity slides adapted from Dan Jurafsky and Bill MacCartney.

Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.

Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 28, 2011.

Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 23, 2011.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Presented by Zeehasham Rasheed

CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.

Distributional Clustering of English Words Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian.

LSA 311 Computational Lexical Semantics Dan Jurafsky Stanford University Lecture 2: Word Sense Disambiguation.

(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.

Word Sense Disambiguation. Word Sense Disambiguation (WSD) Given A word in context A fixed inventory of potential word senses Decide which sense of the.

Lexical Semantics CSCI-GA.2590 – Lecture 7A

Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.

Corpus-Based Approaches to Word Sense Disambiguation

Computational Lexical Semantics Lecture 8: Selectional Restrictions Linguistic Institute 2005 University of Chicago.

2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.

Text Classification, Active/Interactive learning.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.

Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarityReadings: NLTK book Chapter.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.

Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarityReadings: NLTK book Chapter.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Lecture 24 Distributional Word Similarity II Topics Distributional based word similarity example PMI context = syntactic dependenciesReadings: NLTK book.

1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Lecture 24 Distributiona l based Similarity II Topics Distributional based word similarityReadings: NLTK book Chapter 2 (wordnet) Text Chapter 20 April.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.

Graph-based WSD の続き DMLA /7/10 小町守.

Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

SENSEVAL: Evaluating WSD Systems

Sentiment analysis algorithms and applications: A survey

Lecture 24 Distributional Word Similarity II

Relation Extraction CSCI-GA.2591

Entity- & Topic-Based Information Ordering

Corpora and Statistical Methods

Word Meaning and Similarity

Improving a Pipeline Architecture for Shallow Discourse Parsing

CSC 594 Topics in AI – Natural Language Processing

Vector-Space (Distributional) Lexical Semantics

Lecture 21 Computational Lexical Semantics

Statistical NLP: Lecture 9

WordNet WordNet, WSD.

Introduction Task: extracting relational facts from text

A method for WSD on Unrestricted Text

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Lecture 22 Word Similarity

Text Mining & Natural Language Processing

CS246: Information Retrieval

Text Categorization Berlin Chen 2003 Reference:

Information Retrieval

CS224N Section 3: Corpora, etc.

Word embeddings (continued)

Unsupervised Word Sense Disambiguation Using Lesk algorithm

Statistical NLP : Lecture 9 Word Sense Disambiguation

Statistical NLP: Lecture 10

Presentation transcript:

Lecture 22 Word Similarity CSCE 771 Natural Language Processing Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarity Readings: NLTK book Chapter 2 (wordnet) Text Chapter 20 April 8, 2013

Overview Readings: Text 19,20 NLTK Book: Chapter 10 Last Time (Programming) Features in NLTK NL queries  SQL NLTK support for Interpretations and Models Propositional and predicate logic support Prover9 Today Last Lectures slides 25-29 Computational Lexical Semantics Readings: Text 19,20 NLTK Book: Chapter 10 Next Time: Computational Lexical Semantics II

Figure 20.1 Possible sense tags for bass Chapter 20 – Word Sense disambiguation (WSD) Machine translation Supervised vs unsupervised learning Semantic concordance – corpus with words tagged with sense tags

Feature Extraction for WSD Feature vectors Collocation [wi-2, POSi-2, wi-1, POSi-1, wi, POSi, wi+1, POSi+1, wi+2, POSi+2] Bag-of-words – unordered set of neighboring words Represent sets of most frequent content words with membership vector [0,0,1,0,0,0,1] – set of 3rd and 7th most freq. content word Window of nearby words/features

Naïve Bayes Classifier w – word vector s – sense tag vector f – feature vector [wi, POSi ] for i=1, …n Approximate by frequency counts But how practical?

Looking for Practical formula . Still not practical

Naïve == Assume Independence Now practical, but realistic?

Training = count frequencies . Maximum likelihood estimator (20.8)

Decision List Classifiers Naïve Bayes hard for humans to examine decisions and understand Decision list classifiers - like “case” statement sequence of (test, returned-sense-tag) pairs

Figure 20.2 Decision List Classifier Rules

WSD Evaluation, baselines, ceilings Extrinsic evaluation - evaluating embedded NLP in end-to-end applications (in vivo) Intrinsic evaluation – WSD evaluating by itself (in vitro) Sense accuracy Corpora – SemCor, SENSEVAL, SEMEVAL Baseline - Most frequent sense (wordnet sense 1) Ceiling – Gold standard – human experts with discussion and agreement

Similarity of Words or Senses generally we will be saying words but giving similarity of word senses similarity vs relatedness ex similarity ex relatedness Similarity of words Similarity of phrases/sentence (not usually done)

Figure 20.3 Simplified Lesk Algorithm gloss/sentence overlap

Simplified Lesk example The bank can guarantee deposits will eventually cover future tuition costs because it invests in adjustable rate mortgage securities.

Corpus Lesk Using equals weights on words just does not seem right weights applied to overlap words inverse document frequency idfi = log (Ndocs / num docs containing wi)

SENSEVAL competitions http://www.senseval.org/ Check the Senseval-3 website.

SemEval-2 -Evaluation Exercises on Semantic Evaluation - ACL SigLex event

Task Name Area #1 Coreference Resolution in Multiple Languages Coref #2 Cross-Lingual Lexical Substitution Cross-Lingual, Lexical Substitu #3 Cross-Lingual Word Sense Disambiguation Cross-Lingual, Word Senses #4 VP Ellipsis - Detection and Resolution Ellipsis #5 Automatic Keyphrase Extraction from Scientific Articles #6 Classification of Semantic Relations between MeSH Entities in Swedish Medical Texts #7 Argument Selection and Coercion Metonymy #8 Multi-Way Classification of Semantic Relations Between Pairs of Nominals #9 Noun Compound Interpretation Using Paraphrasing Verbs Noun compounds #10 Linking Events and their Participants in Discourse Semantic Role Labeling, Information Extraction #11 Event Detection in Chinese News Sentences Semantic Role Labeling, Word Senses #12 Parser Training and Evaluation using Textual Entailment #13 TempEval 2 Time Expressions #14 Word Sense Induction #15 Infrequent Sense Identification for Mandarin Text to Speech Systems #16 Japanese WSD Word Senses #17 All-words Word Sense Disambiguation on a Specific Domain (WSD-domain) #18 Disambiguating Sentiment Ambiguous Adjectives Word Senses, Sentim

20.4.2 Selectional Restrictions and Preferences verb eat  theme=object has feature Food+ Katz and Fodor 1963 used this idea to rule out senses that were not consistent WSD of disk (20.12) “In out house, evrybody has a career and none of them includes washing dishes,” he says. (20.13) In her tiny kitchen, Ms, Chen works efficiently, stir-frying several simple dishes, inlcuding … Verbs wash, stir-frying wash  washable+ stir-frying  edible+

Resnik’s model of Selectional Association How much does a predicate tell you about the semantic class of its arguments? eat  was, is, to be … selectional preference strength of a verb is indicated by two distributions: P(c) how likely the direct object is to be in class c P(c|v) the distribution of expected semantic classes for the particular verb v the greater the difference in these distributions means the verb provides more information

Relative entropy – Kullback-Leibler divergence Given two distributions P and Q D(P || Q) = ∑ P(x) log (p(x)/Q(x)) (eq 20.16) Selectional preference SR(v) = D( P(c|v) || P(c)) =

Resnik’s model of Selectional Association

High and Low Selectional Associations – Resnik 1996

20.5 Minimally Supervised WSD: Bootstrapping “supervised and dictionary methods require large hand-built resources” bootstrapping or semi-supervised learning or minimally supervised learning to address the no-data problem Start with seed set and grow it.

Yarowsky algorithm preliminaries Idea of bootstrapping: “create a larger training set from a small set of seeds” Heuritics: senses of “bass” one sense per collocation in a sentence both senses of bass are not used one sense per discourse Yarowsky showed that of 37,232 examples of bass occurring in a discourse there was only one sense per discourse Yarowsky

Yarowsky algorithm Goal: learn a word-sense classifier for a word Input: Λ0 small seed set of labeled instances of each sense train classifier on seed-set Λ0, label the unlabeled corpus V0 with the classifier Select examples delta in V that you are “most confident in” Λ1 = Λ0 + delta Repeat

Figure 20.4 Two senses of plant Plant 1 – manufacturing plant … plant 2 – flora, plant life

2009 Survey of WSD by Navigili , iroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf

Figure 20.5 Samples of bass-sentences from WSJ (Wall Street Journal)

Word Similarity: Thesaurus Based Methods Figure 20 Word Similarity: Thesaurus Based Methods Figure 20.6 Path Distances in hierarchy Wordnet of course (pruned)

Figure 20.6 Path Based Similarity . \ simpath(c1, c2)= 1/pathlen(c1, c2) (length + 1)

WN -hierarchy tortoise = wn.synset('tortoise.n.01') novel = wn.synset('novel.n.01') print "LCS(right, minke)=",right.lowest_common_hypernyms(minke) print "LCS(right, orca)=",right.lowest_common_hypernyms(orca) print "LCS(right, tortoise)=",right.lowest_common_hypernyms(tortoise) print "LCS(right, novel)=", right.lowest_common_hypernyms(novel) # Wordnet examples from NLTK book import nltk from nltk.corpus import wordnet as wn right = wn.synset('right_whale.n.01') orca = wn.synset('orca.n.01') minke = wn.synset('minke_whale.n.01')

#path similarity print "Path similarities" print right #path similarity print "Path similarities" print right.path_similarity(minke) print right.path_similarity(orca) print right.path_similarity(tortoise) print right.path_similarity(novel) Path similarities 0.25 0.166666666667 0.0769230769231 0.0434782608696

Wordnet in NLTK http://nltk.org/_modules/nltk/corpus/reader/wordnet.html http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html (partially in Chap 02 NLTK book; but different version) http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet.py code for similarity – runs for a while; lots of results x

https://groups.google.com/forum Hi, I was wondering if it is possible for me to use NLTK + wordnet to group (nouns) words together via similar meanings? Assuming I have 2000 words or topics. Is it possible for me to group them together according to similar meanings using NLTK? So that at the end of the day I would have different groups of words that are similar in meaning? Can that be done in NLTK? and possibly be able to detect salient patterns emerging? (trend in topics etc...). Is there a further need for a word classifier based on the CMU BOW toolkit to classify words to get it into categories? or the above group would be good enough? Is there a need to classify words further? How would one classify words in NLTK effectively? Really hope you can enlighten me? FM beautiful 3/4/10

Response from Steven Bird 2010/3/5 Republic <ngfo...@gmail.com>: > Assuming I have 2000 words or topics. Is it possible for me to group > them together according to similar meanings using NLTK? You could compute WordNet similarity (pairwise), so that each word/topic is represented as a vector of distances, which could then be discretized, so each vector would have a form like this: [0,2,3,1,0,0,2,1,3,...]. These vectors could then be clustered using one of the methods in the NLTK cluster package. > So that at the end of the day I would have different groups of words > that are similar in meaning? Can that be done in NLTK? and possibly be > able to detect salient patterns emerging? (trend in topics etc...). This suggests a temporal dimension, which might mean recomputing the clusters as more words or topics come in. It might help to read the NLTK book sections on WordNet and on text classification, and also some of the other cited material. -Steven Bird Steven Bird 3/7/10

More general? Stack-Overflow import nltk from nltk.corpus import wordnet as wn waiter = wn.synset('waiter.n.01') employee = wn.synset('employee.n.01') all_hyponyms_of_waiter = list(set([w.replace("_"," ") for s in waiter.closure(lambda s:s.hyponyms()) for w in s.lemma_names])) all_hyponyms_of_employee = … if 'waiter' in all_hyponyms_of_employee: print 'employee more general than waiter' elif 'employee' in all_hyponyms_of_waiter: print 'waiter more general than employee' else: http://stackoverflow.com/questions/...-semantic-hierarchies-relations-in--nltk

| res_similarity(self, synset1, synset2, ic, verbose=False) print wn(help) … | res_similarity(self, synset1, synset2, ic, verbose=False) | Resnik Similarity: | Return a score denoting how similar two word senses are, based on the | Information Content (IC) of the Least Common Subsumer (most specific | ancestor node). http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet.py

Similarity based on a hierarchy (=ontology)

Information Content word similarity

Resnick Similarity / Wordnet simresnick(c1, c2) = -log P(LCS(c1, c2))\ wordnet res_similarity(self, synset1, synset2, ic, verbose=False) | Resnik Similarity: | Return a score denoting how similar two word senses are, based on the | Information Content (IC) of the Least Common Subsumer (most specific | ancestor node).

Fig 20.7 Wordnet with Lin P(c) values Change for Resnick!!

Lin variation 1998 Commonality – Difference – IC(description(A,B)) – IC(common(A,B)) simLin(A,B) = Common(A,B) / description(A,B)

Fig 20.7 Wordnet with Lin P(c) values

Extended Lesk based on Example glosses glosses of hypernyms, hyponyms Example drawing paper: paper that is specially prepared for use in drafting decal: the art of transferring designs from specially prepared paper to a wood, glass or metal surface. Lesk score = sum of squares of lengths of common phrases Example: 1 + 22 = 5

Figure 20.8 Summary of Thesaurus Similarity measures

Wordnet similarity functions path_similarity()? lch_similarity()? wup_similarity()? res_similarity()? jcn_similarity()? lin_similarity()?

Problems with thesaurus-based don’t always have a thesaurus Even so problems with recall missing words phrases missing thesauri work less well for verbs and adjectives less hyponymy structure Distributional Word Similarity D. Jurafsky

Distributional models of meaning vector-space models of meaning offer higher recall than hand-built thesauri less precision probably Distributional Word Similarity D. Jurafsky

Word Similarity Distributional Methods 20.31 tezguino example A bottle of tezguino is on the table. Everybody likes tezguino. tezguino makes you drunk. We make tezguino out of corn. What do you know about tezguino?

Distributional Word Similarity D. Jurafsky Term-document matrix Collection of documents Identify collection of important terms, discriminatory terms(words) Matrix: terms X documents – term frequency tfw,d = each document a vector in ZV: Z= integers; N=natural numbers more accurate but perhaps misleading Example Distributional Word Similarity D. Jurafsky

Example Term-document matrix Subset of terms = {battle, soldier, fool, clown} As you like it 12th Night Julius Caesar Henry V Battle 1 8 15 Soldier 2 12 36 fool 37 58 5 clown 6 117 Distributional Word Similarity D. Jurafsky

Figure 20.9 Term in context matrix for word similarity window of 20 words – 10 before 10 after from Brown corpus

Pointwise Mutual Information td-idf (inverse document frequency) rating instead of raw counts idf intuition again – pointwise mutual information (PMI) Do events x and y occur more than if they were independent? PMI(X,Y)= log2 P(X,Y) / P(X)P(Y) PMI between words Positive PMI between two words (PPMI)

Computing PPMI Matrix with W (words) rows and C (contexts) columns fij is frequency of wi in cj,

Example computing PPMI .

Figure 20.10

Figure 20.11

Figure 20.12

Figure 20.13

Figure 20.14

Figure 20.15

Figure 20.16

http://www. cs. ucf. edu/courses/cap5636/fall2011/nltk http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf how to do in nltk NLTK 3.0a1 released : February 2013 This version adds support for NLTK’s graphical user interfaces. http://nltk.org/nltk3-alpha/ which similarity function in nltk.corpus.wordnet is Appropriate for find similarity of two words? I want use a function for word clustering and yarowsky algorightm for find similar collocation in a large text. http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Linguistics http://en.wikipedia.org/wiki/Portal:Linguistics http://en.wikipedia.org/wiki/Yarowsky_algorithm http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html