N-Grams Chapter 4 Part 2.

Slides:

Advertisements

Similar presentations

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Language Modeling.

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.

Albert Gatt Corpora and Statistical Methods – Lecture 7.

SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.

Smoothing Techniques – A Primer

Introduction to N-grams

Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.

1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.

CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd.

1 Advanced Smoothing, Evaluation of Language Models.

Natural Language Processing Lecture 6—9/17/2013 Jim Martin.

1 N-Grams and Corpus Linguistics September 6, 2012 Lecture #4.

Speech and Language Processing

Session 12 N-grams and Corpora Introduction to Speech and Natural Language Processing (KOM422 ) Credits: 3(3-0)

Introduction to language modeling

Slides are from Dan Jurafsky and Schütze Language Modeling.

1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.

Heshaam Faili University of Tehran

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Language acquisition

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Chapter 6: N-GRAMS Heshaam Faili University of Tehran.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Language Modeling 1. Roadmap (for next two classes)  Review LMs  What are they?  How (and where) are they used?  How are they trained?  Evaluation.

9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.

1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab

PRIORS David Kauchak CS159 Fall Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.

Lecture 4 Ngrams Smoothing

Statistical NLP Winter 2009

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

Estimating N-gram Probabilities Language Modeling.

Natural Language Processing Statistical Inference: n-grams

2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Speech and Language Processing Lecture 4 Chapter 4 of SLP.

Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.

Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,

Introduction to N-grams

CSC 594 Topics in AI – Natural Language Processing

Introduction to N-grams

Statistical Language Models

Introduction to N-grams

Smoothing 10/27/2017.

CSCI 5832 Natural Language Processing

N-Grams and Corpus Linguistics

CSCI 5832 Natural Language Processing

Speech and Language Processing

Neural Language Model CS246 Junghoo “John” Cho.

Natural Language Processing

Language Models for Information Retrieval

CPSC 503 Computational Linguistics

Natural Language Processing

N-Gram Model Formulas Word sequences Chain rule of probability

CSCI 5832 Natural Language Processing

CSCE 771 Natural Language Processing

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

CSCE 771 Natural Language Processing

Language acquisition (4:30)

Introduction to N-grams

Professor Junghoo “John” Cho UCLA

Presentation transcript:

N-Grams Chapter 4 Part 2

Today Word prediction task Language modeling (N-grams) N-gram intro The chain rule Model evaluation Smoothing 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Zero Counts Back to Shakespeare Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams... So, 99.96% of the possible bigrams were never seen (have zero entries in the table) 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Zero Counts Some of those zeros are really zeros... Things that really can’t or shouldn’t happen. On the other hand, some of them are just rare events. If the training corpus had been a little bigger they would have had a count (but probably a count of 1!). Zipf’s Law (long tail phenomenon): A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid statistics on low frequency events Result: Our estimates are sparse! We have no counts at all for the vast bulk of things we want to estimate! Answer: Estimate the likelihood of unseen (zero count) N-grams 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Laplace Smoothing Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate: Laplace estimate: Reconstructed counts: 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Bigram Counts Before Smoothing (from the part 1 slides) V = 1446 Eg. “I want” occurred 827 times 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Laplace-Smoothed Bigram Counts 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Laplace-Smoothed Bigram Probabilities 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Reconstituted Counts 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Big Change to the Counts! C(want to) went from 608 to 238! P(to|want) from .66 to .26! Discount d= c*/c d for “chinese food” =.10!!! A 10x reduction So in general, Laplace is a blunt instrument Laplace smoothing not used for N-grams, as there are better methods Despite its flaws Laplace is however still used to smooth other probabilistic models in NLP, because it is simple, especially For pilot studies in domains where the number of zeros isn’t so huge. 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Better Smoothing Intuition used by many smoothing algorithms Good-Turing Kneser-Ney Witten-Bell Is to use the count of things we’ve seen once to help estimate the count of things we’ve never seen We’ll just cover a basic idea 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Good-Turing Idea Imagine you are fishing You have caught There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18 Assuming so, how likely is it that next species is trout? Must be less than 1/18 Slide adapted from Josh Goodman 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Back-off Methods Notice that: How to combine things? N-grams are more precise than (N-1)grams (remember the Shakespeare example) But also, N-grams are more sparse than (N-1) grams How to combine things? Attempt N-grams and back-off to (N-1) if counts are not available E.g. attempt prediction using 4-grams, and back-off to trigrams (or bigrams, or unigrams) if counts are not available

Interpolation 4/23/2018 Speech and Language Processing - Jurafsky and Martin

How to Set the Lambdas? Use a held-out, or development, corpus Choose lambdas which maximize the probability of some held-out data One possibility: use the expectation maximization (EM) algorithm (see chapter 6; not required) 4/23/2018 Speech and Language Processing - Jurafsky and Martin

Practical Issue We do everything in log space Avoid underflow (also adding is faster than multiplying) 4/23/2018 Speech and Language Processing - Jurafsky and Martin