Discussion of assigned readings Lecture 13

Slides:

Advertisements

Similar presentations

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.

Albert Gatt Corpora and Statistical Methods – Lecture 7.

SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.

1 Extended Gloss Overlaps as a Measure of Semantic Relatedness Satanjeev Banerjee Ted Pedersen Carnegie Mellon University University of Minnesota Duluth.

SI485i : NLP Set 11 Distributional Similarity slides adapted from Dan Jurafsky and Bill MacCartney.

Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -

1 Text categorization Feature selection: chi square test.

Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 28, 2011.

Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 23, 2011.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

1 Wordnet and word similarity Lectures 11 and 12.

Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun

Using Information Content to Evaluate Semantic Similarity in a Taxonomy Presenter: Cosmin Adrian Bejan Philip Resnik Sun Microsystems Laboratories.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

1 Advanced Smoothing, Evaluation of Language Models.

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.

SI485i : NLP Set 10 Lexical Relations slides adapted from Dan Jurafsky and Bill MacCartney.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Lexical Semantics CSCI-GA.2590 – Lecture 7A

1 CS 224U Autumn 2007 CS 224U LINGUIST 288/188 Natural Language Understanding Jurafsky and Manning Lecture 2: WordNet, word similarity, and sense relations.

1 Statistical NLP: Lecture 10 Lexical Acquisition.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Text Classification, Active/Interactive learning.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)

10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Lecture 4 Ngrams Smoothing

Chapter 23: Probabilistic Language Models April 13, 2004.

Disambiguation Read J & M Chapter 17.1 – The Problem Washington Loses Appeal on Steel Duties Sue caught the bass with the new rod. Sue played the.

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

Estimating N-gram Probabilities Language Modeling.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.

Natural Language Processing Statistical Inference: n-grams

2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.

Natural Language Processing Topics in Information Retrieval August, 2002.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Plan for Today’s Lecture(s)

N-Grams Chapter 4 Part 2.

CSC 594 Topics in AI – Natural Language Processing

Text Categorization Assigning documents to a fixed set of categories

N-Gram Model Formulas Word sequences Chain rule of probability

Text Categorization Berlin Chen 2003 Reference:

Word embeddings (continued)

Unsupervised Word Sense Disambiguation Using Lesk algorithm

Presentation transcript:

Discussion of assigned readings Lecture 13

How does ‘smoothing’ help in Bayesian word sense disambiguation How does ‘smoothing’ help in Bayesian word sense disambiguation? How do you do this smoothing? Most words appear rarely (remember Heap’s law) The more data you see, the more words you had never seen before you encounter Now imagine you use the word before and the word after the target word as a feature When you test your classifier, you will see context words you never saw during training

How does the naïve Bayes classifier work? Compute the probability of the observed context assuming each sense Multiplication of the probabilities of each individual feature A word not seen in training will have zero probability Choose the most probable sense

Lesson 2: zeros or not? Zipf’s Law: Result: Answer: A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid statistics on low frequency events Result: Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate! Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. After all, ANYTHING CAN HAPPEN! How to address? Answer: Estimate the likelihood of unseen N-grams!

Dealing with unknown words Training: - Assume a fixed vocabulary (e.g. all words that occur at least 5 times in the corpus) - Replace all other words by a token <UNK> - Estimate the model on this corpus Testing: - Replace all unknown words by <UNK> - Run the model

Smoothing is like Robin Hood: Steal from the rich and give to the poor (in probability mass)

Laplace smoothing Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate: Laplace estimate:

Why do pseudo-words give an optimistic results for WSD? Banana-door Find sentences containing either of the words Replace the occurrence of each of the words with a new symbol Here is a big annotated corpus! The correct sense is the original word The problem is that different sense of the same word are often semantically related Pseudo-words are pomonymous rather than polysemous

Let’s correct this Which is better, lexical sample or decision tree classifier, in terms of system performance and results? Lexical sample task Way of formulating the task of WSD A small pre-selected set of target words Classifiers used to solve the task Naïve Bayes Decision trees Support vector machines

What is relative entropy? KL divergence/relative entropy

Selectional preference strength eat x [FOOD] be y [PRETTY-MUCH-ANYTHING] The distribution of expected semantic classes (FOOD, PEOPLE, LIQUIDS) The distribution of expected semantic classes for a particular verb The greater the difference between these distributions, the more information the verb is giving us about possible objects

Bootstrapping WSD algorithm How the algorithm after defining plant under life and manufacturing can go on to define animal and microscopic from life? How does it do that recursively or something? Basically, how other words get labeled other than those we set out to label.

Bootstrapping What if you don’t have enough data to train a system… Pick a word that you as an analyst think will co-occur with your target word in particular sense Grep through your corpus for your target word and the hypothesized word Assume that the target tag is the right one For bass Assume play occurs with the music sense and fish occurs with the fish sense

Sentences extracting using “fish” and “play”

SimLin: why multiply by 2 Common(A,B) Description(A,B)

Word similarity Synonymy is a binary relation We want a looser metric Two words are either synonymous or not We want a looser metric Word similarity or Word distance Two words are more similar If they share more features of meaning Actually these are really relations between senses: Instead of saying “bank is like fund” We say Bank1 is similar to fund3 Bank2 is similar to slope5 We’ll compute them over both words and senses

Two classes of algorithms Thesaurus-based algorithms Based on whether words are “nearby” in Wordnet Distributional algorithms By comparing words based on their context I like having X for dinner? What are the possible values of X

Thesaurus-based word similarity We could use anything in the thesaurus Meronymy Glosses Example sentences In practice By “thesaurus-based” we just mean Using the is-a/subsumption/hypernym hierarchy Word similarity versus word relatedness Similar words are near-synonyms Related could be related any way Car, gasoline: related, not similar Car, bicycle: similar

Path based similarity Two words are similar if nearby in thesaurus hierarchy (i.e. short path between them)

Refinements to path-based similarity pathlen(c1,c2) = number of edges in the shortest path between the sense nodes c1 and c2 simpath(c1,c2) = -log pathlen(c1,c2) wordsim(w1,w2) = maxc1senses(w1),c2senses(w2) sim(c1,c2)

Problem with basic path-based similarity Assumes each link represents a uniform distance Nickel to money seem closer than nickel to standard Instead: Want a metric which lets us represent the cost of each edge independently

Information content similarity metrics Let’s define P(C) as: The probability that a randomly selected word in a corpus is an instance of concept c Formally: there is a distinct random variable, ranging over words, associated with each concept in the hierarchy P(root)=1 The lower a node in the hierarchy, the lower its probability

Information content similarity Train by counting in a corpus 1 instance of “dime” could count toward frequency of coin, currency, standard, etc More formally:

Information content similarity WordNet hieararchy augmented with probabilities P(C)

Information content: definitions IC(c)=-logP(c) Lowest common subsumer LCS(c1,c2) I.e. the lowest node in the hierarchy That subsumes (is a hypernym of) both c1 and c2

Resnik method The similarity between two words is related to their common information The more two words have in common, the more similar they are Resnik: measure the common information as: The info content of the lowest common subsumer of the two nodes simresnik(c1,c2) = -log P(LCS(c1,c2))

SimLin(c1,c2) = 2 x log P (LCS(c1,c2))/ (log P(c1) + log P(c2)) SimLin(hill,coast) = 2 x log P (geological-formation)) / (log P(hill) + log P(coast)) = .59

Extended Lesk Two concepts are similar if their glosses contain similar words Drawing paper: paper that is specially prepared for use in drafting Decal: the art of transferring designs from specially prepared paper to a wood or glass or metal surface For each n-word phrase that occurs in both glosses Add a score of n2 Paper and specially prepared for 1 + 4 = 5

Summary: thesaurus-based similarity

Problems with thesaurus-based methods We don’t have a thesaurus for every language Even if we do, many words are missing They rely on hyponym info: Strong for nouns, but lacking for adjectives and even verbs Alternative Distributional methods for word similarity

Distributional methods for word similarity A bottle of tezgüino is on the table Everybody likes tezgüino Tezgüino makes you drunk We make tezgüino out of corn. Intuition: just from these contexts a human could guess meaning of tezguino So we should look at the surrounding contexts, see what other words have similar context.

Context vector Consider a target word w Suppose we had one binary feature fi for each of the N words in the lexicon vi Which means “word vi occurs in the neighborhood of w” w=(f1,f2,f3,…,fN) If w=tezguino, v1 = bottle, v2 = drunk, v3 = matrix: w = (1,1,0,…)

Intuition Define two words by these sparse features vectors Apply a vector distance metric Say that two words are similar if two vectors are similar

Distributional similarity So we just need to specify 3 things How the co-occurrence terms are defined How terms are weighted (frequency? Logs? Mutual information?) What vector distance metric should we use? Cosine? Euclidean distance?

Defining co-occurrence vectors He drinks X every morning Idea: parse the sentence, extract syntactic dependencies:

Co-occurrence vectors based on dependencies

Measures of association with context We have been using the frequency of some feature as its weight or value But we could use any function of this frequency Let’s consider one feature f=(r,w’) = (obj-of,attack) P(f|w)=count(f,w)/count(w) Assocprob(w,f)=p(f|w)

Weighting: Mutual Information Pointwise mutual information: measure of how often two events x and y occur, compared with what we would expect if they were independent: PMI between a target word w and a feature f :

Mutual information intuition Objects of the verb drink

Lin is a variant on PMI Pointwise mutual information: how often two events x and y occur, compared with what we would expect if they were independent: PMI between a target word w and a feature f : Lin measure: breaks down expected value for P(f) differently:

Similarity measures

What is the baseline algorithm (Lin&Hovy paper) Good question, they don’t say! Random selection First N words

TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages

Unsupervised Discourse Segmentation Hearst (1997): 21-pgraph science news article called “Stargazers” Goal: produce the following subtopic segments:

Intuition of cohesion-based segmentation Sentences or paragraphs in a subtopic are cohesive with each other But not with paragraphs in a neighboring subtopic Thus if we measured the cohesion between every neighboring sentences We might expect a ‘dip’ in cohesion at subtopic boundaries.

TextTiling (Hearst 1997) Tokenization Each space-deliminated word Converted to lower case Throw out stop list words Stem the rest Group into pseudo-sentences of length w=20 Lexical Score Determination: cohesion score Average similarity (cosine measure) between gap (20 pseudo sentences) Boundary Identification

TextTiling algorithm

Cosine

Could you use stemming to compare synsets (distances between words) E.g. musician  music Is there a way to deal with inconsistent granularities of relations

Zipf's law and Heap's law Both are related to the fact that there are a lot of words that one would see very rarely This means that when we build language models (estimate probabilities of words) we will have unreliable estimates for many This is why we were talking about smoothing!

Heap’s law: estimating the number of terms M vocabulary size (number of terms) T number of tokens 30 < k < 100 b = 0.5 Linear relation between vocabulary size and number of tokens in log-log space

Zipf’s law: modeling the distribution of terms The collection frequency of the ith most common term is proportional to 1/i If the most frequent term occurs cf1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc

Bigram Model Approximate by P(unicorn|the mythical) by P(unicorn|mythical) Markov assumption: the probability of a word depends only on the probability of a limited history Generalization: the probability of a word depends only on the probability of the n previous words trigrams, 4-grams, … the higher n is, the more data needed to train backoff models…

A Simple Example: bigram model P(I want to each Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)

Generating WSJ

Maximum tf normalization: wasn’t useful for summarization, why discuss it then?

Accuracy, precision and recall

Accuracy Problematic measure for IR evaluation (tp+tn)/(tp+tn+fp+fn) 99.9% of the documents will be nonrelevant Trivially achieved high performance

Precision

Recall

Precision/Recall trade off Which is more important depends on the user needs Typical web users High precision in the first page of results Paralegals and intelligence analysts Need high recall Willing to tolerate some irrelevant documents as a price

F-measure

Explain what vector representation means Explain what vector representation means? I tried explaining it to my mother and had difficulty as to what this "vector" implies. She kept saying she thought of vectors in terms of graphs.

How can NLP be used in spam?

Using the entire string as feature? NO! Topic categorization: classify the document into semantics topics The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie. One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.

What is relative entropy? KL divergence/relative entropy

Selectional preference strength How do you find verbs that strongly associated with a given subject? eat x [FOOD] be y [PRETTY-MUCH-ANYTHING] The distribution of expected semantic classes (FOOD, PEOPLE, LIQUIDS) The distribution of expected semantic classes for a particular verb The greater the difference between these distributions, the more information the verb is giving us about possible objects