Statistical NLP Winter 2009

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Introduction to N-grams
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Fall 2001 EE669: Natural Language Processing 1 Lecture 6: N-gram Models and Sparse Data (Chapter 6 of Manning and Schutze, Chapter 6 of Jurafsky and Martin,
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
Language Models Data-Intensive Information Processing Applications ― Session #9 Nitin Madnani University of Maryland Tuesday, April 6, 2010 This work is.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
Slides are from Dan Jurafsky and Schütze Language Modeling.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
A Bit of Progress in Language Modeling Extended Version
Language acquisition
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Statistical NLP Winter 2008 Lecture 2: Language Models Roger Levy 多謝 to Dan Klein, Jason Eisner, Joshua Goodman, Stan Chen.
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Language Modeling 1. Roadmap (for next two classes)  Review LMs  What are they?  How (and where) are they used?  How are they trained?  Evaluation.
1 Introduction to Natural Language Processing ( ) Mutual Information and Word Classes AI-lab
Language acquisition
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
Introduction to N-grams Language Modeling. Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation: P(high.
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Natural Language Processing Statistical Inference: n-grams
Statistical NLP Spring 2010 Lecture 2: Language Models Dan Klein – UC Berkeley.
CSE 517 Natural Language Processing Winter 2015 Language Models Yejin Choi Slides adapted from Dan Klein, Michael Collins, Luke Zettlemoyer, Dan Jurafsky.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
CSE 517 Natural Language Processing Winter 2015 Language Models Yejin Choi Slides adapted from Dan Klein, Michael Collins, Luke Zettlemoyer, Dan Jurafsky.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
Statistical NLP Spring 2010 Lecture 3: LMs II / Text Cat Dan Klein – UC Berkeley.
N-Grams Chapter 4 Part 2.
Smoothing 10/27/2017.
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Language acquisition (4:30)
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Statistical NLP Winter 2009 Language models, part II: smoothing Roger Levy thanks to Dan Klein and Jason Eisner

Recap: Language Models Why are language models useful? Samples of generated text What are the main challenges in building n-gram language models? Discounting versus Backoff/Interpolation

Smoothing … … P(w | denied the) 3 allegations 2 reports 1 claims We often want to make estimates from sparse statistics: Smoothing flattens spiky distributions so they generalize better Very important all over NLP, but easy to do badly! We’ll illustrate with bigrams today (h = previous word, could be anything). P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total allegations reports attack man outcome … claims request P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total allegations allegations attack man outcome … reports claims request

Vocabulary Size Key issue for language models: open or closed vocabulary? A closed vocabulary means you can fix a set of words in advanced that may appear in your training set An open vocabulary means that you need to hold out probability mass for any possible word Generally managed by fixing a vocabulary list; words not in this list are OOVs When would you want an open vocabulary? When would you want a closed vocabulary? How to set the vocabulary size V? By external factors (e.g. speech recognizers) Using statistical estimates? Difference between estimating unknown token rate and probability of a given unknown word Practical considerations In many cases, open vocabularies use multiple types of OOVs (e.g., numbers & proper names) For the programming assignment: OK to assume there is only one unknown word type, UNK UNK be quite common in new text! UNK stands for all unknown word type

Five types of smoothing Today we’ll cover Add- smoothing (Laplace) Simple interpolation Good-Turing smoothing Katz smoothing Kneser-Ney smoothing

Smoothing: Add- (for bigram models) c number of word tokens in training data c(w) count of word w in training data c(w-1,w) joint count of the w-1,w bigram V total vocabulary size (assumed known) Nk number of word types with count k One class of smoothing functions (discounting): Add-one / delta: If you know Bayesian statistics, this is equivalent to assuming a uniform prior Another (better?) alternative: assume a unigram prior: How would we estimate the unigram model?

Linear Interpolation One way to ease the sparsity problem for n-grams is to use less-sparse n-1-gram estimates General linear interpolation: Having a single global mixing constant is generally not ideal: A better yet still simple alternative is to vary the mixing constant as a function of the conditioning context

Held-Out Data Training Data Held-Out Data Test Data LL  Important tool for getting models to generalize: When we have a small number of parameters that control the degree of smoothing, we set them to maximize the (log-)likelihood of held-out data Can use any optimization technique (line search or EM usually easiest) Examples: Training Data Held-Out Data Test Data LL 

Held-Out Reweighting What’s wrong with unigram-prior smoothing? Let’s look at some real bigram counts [Church and Gale 91]: Big things to notice: Add-one vastly overestimates the fraction of new bigrams Add-0.0000027 still underestimates the ratio 2*/1* One solution: use held-out data to predict the map of c to c* Count in 22M Words Actual c* (Next 22M) Add-one’s c* Add-0.0000027’s c* 1 0.448 2/7e-10 ~1 2 1.25 3/7e-10 ~2 3 2.24 4/7e-10 ~3 4 3.23 5/7e-10 ~4 5 4.21 6/7e-10 ~5 Mass on New 9.2% ~100% Ratio of 2/1 2.8 1.5 ~2

Good-Turing smoothing Motivation: how can we estimate how likely events we haven’t yet seen are to occur? Insight: singleton events are our best indicator for this probability Generalizing the insight: cross-validated models We want to estimate P(wi) on the basis of the corpus C - wi But we can’t just do this naively (why not?) Training Data (C) wi

Good-Turing Reweighting I Take each of the c training words out in turn c training sets of size c-1, held-out of size 1 What fraction of held-out word (tokens) are unseen in training? N1/c What fraction of held-out words are seen k times in training? (k+1)Nk+1/c So in the future we expect (k+1)Nk+1/c of the words to be those with training count k There are Nk words with training count k Each should occur with probability: (k+1)Nk+1/(cNk) …or expected count (k+1)Nk+1/Nk N1 N2 N3 N4417 N3511 . . . . N0 N4416 N3510

Good-Turing Reweighting II Problem: what about “the”? (say c=4417) For small k, Nk > Nk+1 For large k, too jumpy, zeros wreck estimates Simple Good-Turing [Gale and Sampson]: replace empirical Nk with a best-fit regression (e.g., power law) once count counts get unreliable N1 N0 N1 N2 N1 N2 N3 N3 N2 . . . . . . . . N1 N2 N3511 N3510 N4417 N4416

Good-Turing Reweighting III Hypothesis: counts of k should be k* = (k+1)Nk+1/Nk Not bad! Count in 22M Words Actual c* (Next 22M) GT’s c* 1 0.448 0.446 2 1.25 1.26 3 2.24 4 3.23 3.24 Mass on New 9.2%

Katz & Kneser-Ney smoothing Our last two smoothing techniques to cover (for n-gram models) Each of them combines discounting & backoff in an interesting way The approaches are substantially different, however

Katz Smoothing Katz (1987) extended the idea of Good-Turing (GT) smoothing to higher models, incoropating backoff Here we’ll focus on the backoff procedure Intuition: when we’ve never seen an n-gram, we want to back off (recursively) to the lower order n-1-gram So we want to do: But we can’t do this (why not?)

Katz Smoothing II We can’t do But if we use GT-discounted estimates P*(w|w-1), we do have probability mass left over for the unseen bigrams There are a couple of ways of using this. We could do: or see books, Chen & Goodman 1998 for more details

Kneser-Ney Smoothing I Something’s been very broken all this time Shannon game: There was an unexpected ____? delay? Francisco? “Francisco” is more common than “delay” … but “Francisco” always follows “San” Solution: Kneser-Ney smoothing In the back-off model, we don’t want the unigram probability of w Instead, probability given that we are observing a novel continuation Every bigram type was a novel continuation the first time it was seen

Kneser-Ney Smoothing II One more aspect to Kneser-Ney: Absolute discounting Save ourselves some time and just subtract 0.75 (or some d) Maybe have a separate value of d for very low counts More on the board

What Actually Works? Trigrams: Unigrams, bigrams too little context Trigrams much better (when there’s enough data) 4-, 5-grams usually not worth the cost (which is more than it seems, due to how speech recognizers are constructed) Good-Turing-like methods for count adjustment Absolute discounting, Good-Turing, held-out estimation, Witten-Bell Kneser-Ney equalization for lower-order models See [Chen+Goodman] reading for tons of graphs! [Graphs from Joshua Goodman]

Data >> Method? Having more data is always good… … but so is picking a better smoothing mechanism! N > 3 often not worth the cost (greater than you’d think)

Beyond N-Gram LMs Caching Models Skipping Models Recent words more likely to appear again Skipping Models Clustering Models: condition on word classes when words are too sparse Trigger Models: condition on bag of history words (e.g., maxent) Structured Models: use parse structure (we’ll see these later)