Natural Language Processing

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Introduction to N-grams
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
Machine Translation Course 3 Diana Trandab ă ț Academic year:
Speech and Language Processing
Session 12 N-grams and Corpora Introduction to Speech and Natural Language Processing (KOM422 ) Credits: 3(3-0)
Slides are from Dan Jurafsky and Schütze Language Modeling.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
N-gram Language Models
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
Introduction to N-grams Language Modeling. Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation: P(high.
Estimating N-gram Probabilities Language Modeling.
Natural Language Processing Statistical Inference: n-grams
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Statistical Methods for NLP Diana Trandab ă ț
Speech and Language Processing Lecture 4 Chapter 4 of SLP.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Statistical Methods for NLP
Introduction to N-grams
Statistical Machine Translation Part II: Word Alignments and EM
N-Grams Chapter 4 Part 2.
CSC 594 Topics in AI – Natural Language Processing
Introduction to N-grams
Statistical Language Models
Introduction to N-grams
CSCI 5417 Information Retrieval Systems Jim Martin
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Speech and Language Processing
CPSC 503 Computational Linguistics
N-Gram Model Formulas Word sequences Chain rule of probability
CSCI 5832 Natural Language Processing
CSCE 771 Natural Language Processing
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
CPSC 503 Computational Linguistics
INF 141: Information Retrieval
Introduction to N-grams
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Natural Language Processing Giuseppe Attardi Language Modeling IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein

Outline Language Modeling (N-grams) N-gram Intro The Chain Rule The Shannon Visualization Method Evaluation: Perplexity Smoothing: Laplace (Add-1) Add-prior

Probabilistic Language Model Goal: assign a probability to a sentence Machine Translation: P(high winds tonite) > P(large winds tonite) Spell Correction “The office is about fifteen minuets from my house" P(about fifteen minutes from) > P(about fifteen minuets from) Speech Recognition P(I saw a van) >>  P(eyes awe of an) Summarization, question-­answering, etc.

Why Language Models We have an English speech recognition system, which answer is better? Speech Interpretation speech recognition system speech cognition system speck podcast histamine スピーチ が 救出 ストン Language models tell us the answer!

Language Modeling We want to compute Alternatively we want to compute P(w1,w2,w3,w4,w5…wn) = P(W) = the probability of a sequence Alternatively we want to compute P(w5|w1,w2,w3,w4) = the probability of a word given some previous words The model that computes P(W) or P(wn|w1,w2…wn-1) is called the language model. A better term for this would be “The Grammar” But “Language model” or LM is standard

Computing P(W) How to compute this joint probability: P(“the”, “other”, “day”, “I”, “was”, “walking”, “along”, “and”, “saw”, “a”, “lizard”) Intuition: let’s rely on the Chain Rule of Probability

The Chain Rule Recall the definition of conditional probabilities Rewriting: More generally P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) In general P(x1,x2,x3,…xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1…xn-1)

The Chain Rule applied to joint probability of words in sentence P(“the big red dog was”) = P(the) • P(big|the) • P(red|the big) • P(dog|the big red) • P(was|the big red dog)

Obvious estimate How to estimate? P(the | its water is so transparent that) P(the | its water is so transparent that) = C(its water is so transparent that the) ____________________________________________________________________________________________ C(its water is so transparent that)

Unfortunately There are a lot of possible sentences We will never be able to get enough data to compute the statistics for those long prefixes P(lizard|the,other,day,I,was,walking,along,and,saw,a) or P(the|its water is so transparent that)

Markov Assumption Make the simplifying assumption or maybe P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|a) or maybe P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|saw,a)

Markov Assumption So for each component in the product, replace with the approximation (assuming a prefix of N) Bigram model

N-gram models We can extend to trigrams, 4-­grams, 5-­grams In general this is an insufficient model of language because language has long-­distance dependencies: “The computer which I had just put into the machine room on the fifth floor crashed.” But we can often get away with N-­gram models

Estimating bigram probabilities The Maximum Likelihood Estimate

An example <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> This is the Maximum Likelihood Estimate, because it is the one which maximizes P(Training set|Model)

Maximum Likelihood Estimates The Maximum Likelihood Estimate of some parameter of a model M from a training set T is the estimate that maximizes the likelihood of the training set T given the model M Suppose the word “Chinese” occurs 400 times in a corpus of a million words (e.g. the Brown corpus) What is the probability that a random word from some other text will be “Chinese” MLE estimate is 400/1000000 = .004 This may be a bad estimate for some other corpus But it is the estimate that makes it most likely that “Chinese” will occur 400 times in a million word corpus.

Probability of observation Maximum Likelihood We want to estimate the probability, p, that individuals are infected with a certain kind of parasite. The maximum likelihood method (discrete distribution): Write down the probability of each observation by using the model parameters Write down the probability of all the data Find the value parameter(s) that maximize this probability Ind. Infected Probability of observation 1 p 2 1 – p 3 4 5 6 7 8 9 10

Probability of observation Maximum likelihood We want to estimate the probability, p, that individuals are infected with a certain kind of parasite. Likelihood function: - Find the value parameter(s) that maximize this probability Ind. Infected Probability of observation 1 p 2 1 – p 3 4 5 6 7 8 9 10

Computing the MLE Set the derivative to 0: Solutions: p = 0 (minimum) p = 0.6 (maximum)

More examples: Berkeley Restaurant Project can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day

Raw bigram counts Out of 9222 sentences

Raw bigram probabilities Normalize by unigrams (divide by C(w-1)): Result:

Bigram estimates of sentence probabilities P(<s> I want english food </s>) = P(i|<s>) x P(want|I) x P(english|want) x P(food|english) x P(</s>|food) =.000031

What kinds of knowledge? P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P(i | <s>) = .25

Practical Issues Compute in log space Avoid underflow Adding is faster than multiplying log(p1 • p2 • p3 • p4) = log(p1) + log(p2) + log(p3) + log(p4)

Shannon’s Game What if we turn these models around and use them to generate random sentences that are like the sentences from which the model was derived. Jim Martin

The Shannon Visualization Method Generate random sentences: Choose a random bigram <s>, w according to its probability Now choose a random bigram (w, x) according to its probability And so on until we choose </s> Then string the words together <s> I I want want to to eat eat Chinese Chinese food food </s>

Approximating Shakespeare

Shakespeare as corpus N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table) Quadrigrams: What's coming out looks like Shakespeare because it is Shakespeare

The Wall Street Journal is not Shakespeare (no offense)

Lesson 1: the perils of overfitting N-grams only work well for word prediction if the test corpus looks like the training corpus In real life, it often doesn’t We need to train robust models, adapt to test set, etc.

Train and Test Corpora A language model must be trained on a large corpus of text to estimate good parameter values. Model can be evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate). Ideally, the training (and test) corpus should be representative of the actual application data. May need to adapt a general model to a small amount of new (in-domain) data by adding highly weighted small corpus to original training data.

Smoothing

Smoothing Since there are a combinatorial number of possible word sequences, many rare (but not impossible) combinations never occur in training, so MLE incorrectly assigns zero to many parameters (aka sparse data). If a new combination occurs during testing, it is given a probability of zero and the entire sequence gets a probability of zero (i.e. infinite perplexity). In practice, parameters are smoothed (aka regularized) to reassign some probability mass to unseen events. Adding probability mass to unseen events requires removing it from seen ones (discounting) in order to maintain a joint distribution that sums to 1.

Smoothing is like Robin Hood: Steal from the rich and give to the poor (in probability mass) Slide from Dan Klein

Laplace smoothing Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate: Laplace estimate: Reconstructed counts:

Laplace smoothed bigram counts Berkeley Restaurant Corpus

Laplace-smoothed bigrams

Reconstituted counts  

Note big change to counts C(want to) went from 608 to 238! P(to|want) from .66 to .26! Discount d = c*/c d for “chinese food” = .10 A 10x reduction! So in general, Laplace is a blunt instrument But Laplace smoothing not used for N-grams, as we have much better methods Despite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially For pilot studies in domains where the number of zeros isn’t so huge.

Add-k Add a small fraction instead of 1 k = 0.01

Even better: Bayesian unigram prior smoothing for bigrams Maximum Likelihood Estimation Laplace Smoothing Bayesian Prior Smoothing

Lesson 2: zeros or not? Zipf’s Law: Result: Answer: A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid statistics on low frequency events Result: Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate! Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. After all, ANYTHING CAN HAPPEN! How to address? Answer: Estimate the likelihood of unseen N-grams! Slide from B. Dorr and J. Hirschberg

Zipf's law f  1/r (f proportional to 1/r) there is a constant k such that f  r = k

Zipf's Law for the Brown Corpus

Zipf law: interpretation Principle of least effort: both the speaker and the hearer in communication try to minimize effort: Speakers tend to use a small vocabulary of common (shorter) words Hearers prefer a large vocabulary of rarer less ambiguous words Zipf's law is the result of this compromise Other laws … Number of meanings m of a word obeys the law: m  1/f Inverse relationship between frequency and length

Practical Issues We do everything in log space Avoid underflow (also adding is faster than multiplying)

Language Modeling Toolkits SRILM http://www.speech.sri.com/projects/srilm/ IRSTLM Ken LM

Google N-Gram Release

Google Book N-grams http://ngrams.googlelabs.com/

Google N-Gram Release serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234 http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Evaluation and Perplexity

Evaluation Train parameters of our model on a training set. How do we evaluate how well our model works? Look at the models performance on some new data This is what happens in the real world; we want to know how our model performs on data we haven’t seen Use a test set. A dataset which is different than our training set Then we need an evaluation metric to tell us how well our model is doing on the test set. One such metric is perplexity

Evaluating N-gram models Best evaluation for an N-gram Put model A in a task (language identification, speech recognizer, machine translation system) Run the task, get an accuracy for A (how many langs identified correctly, or Word Error Rate, or etc) Put model B in task, get accuracy for B Compare accuracy for A and B Extrinsic evaluation

Language Identification task Create an N-gram model for each language Compute the probability of a given text Plang1(text) Plang2(text) Plang3(text) Select language with highest probability lang = argmaxl Pl(text)

Difficulty of extrinsic (in-vivo) evaluation of N-gram models Extrinsic evaluation This is really time-consuming Can take days to run an experiment So As a temporary solution, in order to run experiments To evaluate N-grams we often use an intrinsic evaluation, an approximation called perplexity But perplexity is a poor approximation unless the test data looks just like the training data So is generally only useful in pilot experiments (generally is not sufficient to publish)

Perplexity The intuition behind perplexity as a measure is the notion of surprise. How surprised is the language model when it sees the test set? Where surprise is a measure of... Gee, I didn’t see that coming... The more surprised the model is, the lower the probability it assigned to the test set The higher the probability, the less surprised it was

Perplexity Measures of how well a model “fits” the test data. Uses the probability that the model assigns to the test corpus. Normalizes for the number of words in the test corpus and takes the inverse. Measures the weighted average branching factor in predicting the next word (lower is better).

Perplexity Perplexity: Chain rule: For bigrams: Minimizing perplexity is the same as maximizing probability The best language model is one that best predicts an unseen test set

Perplexity as branching factor How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9’ Perplexity: 10

Lower perplexity = better model Model trained on 38 million words from the Wall Street Journal (WSJ) using a 19,979 word vocabulary. Evaluation on a disjoint set of 1.5 million WSJ words. N-gram Order Unigram Bigram Trigram Perplexity 962 170 109

Unknown Words How to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words? Train a model that includes an explicit symbol for an unknown word (<UNK>): Choose a vocabulary in advance and replace other words in the training corpus with <UNK>, or Replace the first occurrence of each word in the training data with <UNK>.

Unknown Words handling Training of <UNK> probabilities Create a fixed lexicon L of size V Any training word not in L changed to <UNK> Now we train its probabilities like a normal word At decoding time In text input: use <UNK> probabilities for any word not in training

Smoothing

Advanced LM stuff Current best smoothing algorithm Other stuff Kneser-Ney smoothing Other stuff Interpolation Backoff Variable-length n-grams Class-based n-grams Clustering Hand-built classes Cache LMs Topic-based LMs Sentence mixture models Skipping LMs Parser-based LMs Word Embeddings

Backoff and Interpolation If we are estimating: Trigram P(z|xy) but C(xyz) is zero Use info from: Bigram P(z|y) Or even: Unigram P(z) How to combine the trigram/bigram/unigram info?

Backoff versus interpolation Backoff: use trigram if you have it, otherwise bigram, otherwise unigram Interpolation: mix all three

Backoff Only use lower-order model when data for higher-order model is unavailable Recursively back-off to weaker models until data is available Where P* is a discounted probability estimate to reserve mass for unseen events and ’s are back-off weights (see book for details).

Interpolation Simple interpolation Lambdas conditional on context:

How to set the lambdas? Training Data Use a held-out corpus Choose lambdas which maximize the probability of data i.e. fix the N-gram probabilities then search for lambda values that, when plugged into previous equation, give largest probability for held-out set Can use EM (Expectation Maximization) to do this search Training Data Held-Out Data Test Data

Intuition of backoff+discounting How much probability to assign to all the zero trigrams? Use Good-Turing or other discounting algorithm How to divide that probability mass among different contexts? Use the N-1 gram estimates What do we do for the unigram words not seen in training? Out Of Vocabulary = OOV words

Problem for N-Grams: Long Distance Dependencies Sometimes local context does not provide enough predictive clues, due to the presence of long-distance dependencies. Syntactic dependencies “The man next to the large oak tree near the grocery store on the corner is tall.” “The men next to the large oak tree near the grocery store on the corner are tall.” Semantic dependencies “The bird next to the large oak tree near the grocery store on the corner flies rapidly.” “The man next to the large oak tree near the grocery store on the corner talks rapidly.” More complex models of language are needed to handle such dependencies.

ARPA format

Language Models Language models assign a probability that a sentence is a legal string in a language. They are useful as a component of many NLP systems, such as ASR, OCR, and MT. Simple N-gram models are easy to train on unsupervised corpora and can provide useful estimates of sentence likelihood. MLE gives inaccurate parameters for models trained on sparse data. Smoothing techniques adjust parameter estimates to account for unseen (but not impossible) events.

Exercise Write two programs train-unigram: Creates a unigram model test-unigram: Reads a unigram model and calculates entropy and coverage for the test set Test them test/01-train-input.txt test/01-test-input.txt Train the model on data/wiki-en-train.word Calculate entropy and coverage on data/wiki-entest.word Report your scores next week

Pseudo code: train-unigram create a map counts create a variable total_count = 0 for each line in the training_file split line into an array of words append “</s>” to the end of words for each word in words add 1 to counts[word] add 1 to total_count open the model_file for writing for each word, count in counts probability = counts[word]/total_count print word, probability to model_file

Pseudo-code: test-unigram Load model Test and print create a map probabilities for each line in model_file split line into w and P set probabilities[w] = P for each line in test_file split line into an array of words append “</s>” to the end of words for each w in words add 1 to W set P = λunk / V if probabilities[w] exists set P += λ1 * probabilities[w] else add 1 to unk add -log2 P to H print “entropy = ”+H/W print “coverage = ” + (W-unk)/W

Summary Language Modeling (N-grams) N-grams The Chain Rule The Shannon Visualization Method Evaluation: Perplexity Smoothing: Laplace (Add-1) Add-k Add-prior