N-Grams Chapter 4 Part 2.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Introduction to N-grams
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd.
1 Advanced Smoothing, Evaluation of Language Models.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
1 N-Grams and Corpus Linguistics September 6, 2012 Lecture #4.
Speech and Language Processing
Session 12 N-grams and Corpora Introduction to Speech and Natural Language Processing (KOM422 ) Credits: 3(3-0)
Introduction to language modeling
Slides are from Dan Jurafsky and Schütze Language Modeling.
1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Language acquisition
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Language Modeling 1. Roadmap (for next two classes)  Review LMs  What are they?  How (and where) are they used?  How are they trained?  Evaluation.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
PRIORS David Kauchak CS159 Fall Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.
Lecture 4 Ngrams Smoothing
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Estimating N-gram Probabilities Language Modeling.
Natural Language Processing Statistical Inference: n-grams
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Speech and Language Processing Lecture 4 Chapter 4 of SLP.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
Introduction to N-grams
CSC 594 Topics in AI – Natural Language Processing
Introduction to N-grams
Statistical Language Models
Introduction to N-grams
Smoothing 10/27/2017.
CSCI 5832 Natural Language Processing
N-Grams and Corpus Linguistics
CSCI 5832 Natural Language Processing
Speech and Language Processing
Neural Language Model CS246 Junghoo “John” Cho.
Natural Language Processing
Language Models for Information Retrieval
CPSC 503 Computational Linguistics
Natural Language Processing
N-Gram Model Formulas Word sequences Chain rule of probability
CSCI 5832 Natural Language Processing
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Language acquisition (4:30)
Introduction to N-grams
Professor Junghoo “John” Cho UCLA
Presentation transcript:

N-Grams Chapter 4 Part 2

Today Word prediction task Language modeling (N-grams) N-gram intro The chain rule Model evaluation Smoothing 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Zero Counts Back to Shakespeare Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams... So, 99.96% of the possible bigrams were never seen (have zero entries in the table) 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Zero Counts Some of those zeros are really zeros... Things that really can’t or shouldn’t happen. On the other hand, some of them are just rare events. If the training corpus had been a little bigger they would have had a count (but probably a count of 1!). Zipf’s Law (long tail phenomenon): A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid statistics on low frequency events Result: Our estimates are sparse! We have no counts at all for the vast bulk of things we want to estimate! Answer: Estimate the likelihood of unseen (zero count) N-grams 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Laplace Smoothing Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate: Laplace estimate: Reconstructed counts: 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Bigram Counts Before Smoothing (from the part 1 slides) V = 1446 Eg. “I want” occurred 827 times 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Laplace-Smoothed Bigram Counts 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Laplace-Smoothed Bigram Probabilities 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Reconstituted Counts 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Big Change to the Counts! C(want to) went from 608 to 238! P(to|want) from .66 to .26! Discount d= c*/c d for “chinese food” =.10!!! A 10x reduction So in general, Laplace is a blunt instrument Laplace smoothing not used for N-grams, as there are better methods Despite its flaws Laplace is however still used to smooth other probabilistic models in NLP, because it is simple, especially For pilot studies in domains where the number of zeros isn’t so huge. 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Better Smoothing Intuition used by many smoothing algorithms Good-Turing Kneser-Ney Witten-Bell Is to use the count of things we’ve seen once to help estimate the count of things we’ve never seen We’ll just cover a basic idea 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Good-Turing Idea Imagine you are fishing You have caught There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18 Assuming so, how likely is it that next species is trout? Must be less than 1/18 Slide adapted from Josh Goodman 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Back-off Methods Notice that: How to combine things? N-grams are more precise than (N-1)grams (remember the Shakespeare example) But also, N-grams are more sparse than (N-1) grams How to combine things? Attempt N-grams and back-off to (N-1) if counts are not available E.g. attempt prediction using 4-grams, and back-off to trigrams (or bigrams, or unigrams) if counts are not available

Interpolation 4/23/2018 Speech and Language Processing - Jurafsky and Martin

How to Set the Lambdas? Use a held-out, or development, corpus Choose lambdas which maximize the probability of some held-out data One possibility: use the expectation maximization (EM) algorithm (see chapter 6; not required) 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Practical Issue We do everything in log space Avoid underflow (also adding is faster than multiplying) 4/23/2018 Speech and Language Processing - Jurafsky and Martin