SI485i : NLP Set 11 Distributional Similarity slides adapted from Dan Jurafsky and Bill MacCartney.

Slides:

Advertisements

Similar presentations

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Advertisements

2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.

The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.

Text Similarity David Kauchak CS457 Fall 2011.

Outline What is a collocation?

Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20.

Semantic similarity, vector space models and word- sense disambiguation Corpora and Statistical Methods Lecture 6.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.

Latent Semantic Analysis

Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 28, 2011.

Discussion of assigned readings Lecture 13

Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka Topic  Semantic similarity measures.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

Clustering Unsupervised learning Generating “classes”

Vector Semantics Introduction. Dan Jurafsky Why vector models of meaning? computing the similarity between words “fast” is similar to “rapid” “tall” is.

Entropy and some applications in image processing Neucimar J. Leite Institute of Computing

Advanced Multimedia Text Classification Tamara Berg.

1 CS 224U Autumn 2007 CS 224U LINGUIST 288/188 Natural Language Understanding Jurafsky and Manning Lecture 2: WordNet, word similarity, and sense relations.

Text Analysis Everything Data CompSci Spring 2014.

Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

1 Statistical NLP: Lecture 10 Lexical Acquisition.

Computational Lexical Semantics Lecture 8: Selectional Restrictions Linguistic Institute 2005 University of Chicago.

Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof Ciesielski Institute of Computer Science, Poland.

Short Text Understanding Through Lexical-Semantic Analysis

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarityReadings: NLTK book Chapter.

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval.

Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.

Estimating N-gram Probabilities Language Modeling.

Random Variables Ch. 6. Flip a fair coin 4 times. List all the possible outcomes. Let X be the number of heads. A probability model describes the possible.

Outline Problem Background Theory Extending to NLP and Experiment

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Lecture 24 Distributional Word Similarity II Topics Distributional based word similarity example PMI context = syntactic dependenciesReadings: NLTK book.

Link Distribution on Wikipedia [0407]KwangHee Park.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Lecture 24 Distributiona l based Similarity II Topics Distributional based word similarityReadings: NLTK book Chapter 2 (wordnet) Text Chapter 20 April.

Vector Semantics.

An Adaptive User Profile for Filtering News Based on a User Interest Hierarchy Sarabdeep Singh, Michael Shepherd, Jack Duffy and Carolyn Watters Web Information.

Vector Semantics Dense Vectors.

Outline Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM.

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

Probability Distributions ( 확률분포 ) Chapter 5. 2 모든 가능한 ( 확률 ) 변수의 값에 대해 확률을 할당하는 체계 X 가 1, 2, …, 6 의 값을 가진다면 이 6 개 변수 값에 확률을 할당하는 함수 Definition.

From Frequency to Meaning: Vector Space Models of Semantics

Vector Semantics. Dan Jurafsky Why vector models of meaning? computing the similarity between words “fast” is similar to “rapid” “tall” is similar to.

Best pTree organization? level-1 gives te, tf (term level)

Semantic Processing with Context Analysis

Lecture 24 Distributional Word Similarity II

Lecture 2: WordNet, word similarity, and sense relations

Vector Semantics Introduction.

Robust Semantics, Information Extraction, and Information Retrieval

Word Meaning and Similarity

CSC 594 Topics in AI – Natural Language Processing

Vector-Space (Distributional) Lexical Semantics

From frequency to meaning: vector space models of semantics

Word Embedding Word2Vec.

Lecture 22 Word Similarity

Word embeddings (continued)

Natural Language Processing Is So Difficult

Term Frequency–Inverse Document Frequency

Statistical NLP: Lecture 10

Presentation transcript:

SI485i : NLP Set 11 Distributional Similarity slides adapted from Dan Jurafsky and Bill MacCartney

Distributional methods Firth (1957) “You shall know a word by the company it keeps!” Example from Nida (1975) noted by Lin: A bottle of tezgüino is on the table Everybody likes tezgüino Tezgüino makes you drunk We make tezgüino out of corn Intuition: Just from these contexts, a human could guess meaning of tezgüino So we should look at the surrounding contexts, see what other words have similar context

Fill-in-the-blank on Google You can get a quick & dirty impression of what words show up in a given context by putting a * in your Google query:

Context vector Consider a target word w Suppose we had one binary feature f i for each of the N words in the lexicon v i Which means “word v i occurs in the neighborhood of w” w = (f 1, f 2, f 3, …, f N ) If w = tezgüino, v 1 = bottle, v 2 = drunk, v 3 = matrix : w = (1, 1, 0, …)

Intuition Define two words by these sparse feature vectors Apply a vector distance metric Call two words similar if their vectors are similar

Distributional similarity So we just need to specify 3 things: 1.How the co-occurrence terms are defined 2.How terms are weighted (Boolean? Frequency? Logs? Mutual information?) 3.What vector similarity metric should we use? Euclidean distance? Cosine? Jaccard? Dice?

1. Defining co-occurrence vectors We could have windows of neighboring words Bag-of-words We generally remove stopwords But we lose any sense of syntax Instead use the words occurring in particular grammatical relations

Defining co-occurrence vectors “The meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entitites relative to other entities.” Zellig Harris (1968) Idea: parse the sentence, extract grammatical dependencies

Co-occurrence vectors based on grammatical dependencies For the word cell : vector of N × R features (R is the number of dependency relations)

2. Weighting the counts (“Measures of association with context”) We have been using the frequency count of some feature as its weight or value But we could use any function of this frequency Let’s consider one feature f = (r, w’) = (obj-of, attack) P(f|w) = count(f, w) / count(w) Assoc prob (w, f) = p(f|w)

Frequency-based problems “drink it” is more common than “drink wine” ! But “wine” is a better “drinkable” thing than “it” We need to control for expected frequency Instead, normalize by the expected frequency Objects of the verb drink:

Weighting: Mutual Information Mutual information between random variables X and Y Pointwise mutual information: measure of how often two events x and y occur, compared with what we would expect if they were independent:

Weighting: Mutual Information Pointwise mutual information: measure of how often two events x and y occur, compared with what we would expect if they were independent: PMI between a target word w and a feature f :

Mutual information intuition Objects of the verb drink

Summary: weightings See Manning and Schuetze (1999) for more

3. Defining vector similarity

Summary of similarity measures

Evaluating similarity measures Intrinsic evaluation Correlation with word similarity ratings from humans Extrinsic (task-based, end-to-end) evaluation Malapropism (spelling error) detection WSD Essay grading Plagiarism detection Taking TOEFL multiple-choice vocabulary tests Language modeling in some application

An example of detected plagiarism