Estimating N-gram Probabilities Language Modeling.

Slides:



Advertisements
Similar presentations
Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011.
Advertisements

Chapter 6: Statistical Inference: n-gram Models over Sparse Data
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling for ASR Andreas Stolcke Microsoft and ICSI (based on slides from Dan Jurafsky at Stanford)
Language Modeling.
N-Grams Chapter 4 Part 1.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Introduction to N-grams
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
10-Jun-15 Introduction to Primitives. 2 Overview Today we will discuss: The eight primitive types, especially int and double Declaring the types of variables.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Natural Language Processing Expectation Maximization.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
1 N-Grams and Corpus Linguistics September 6, 2012 Lecture #4.
Speech and Language Processing
Session 12 N-grams and Corpora Introduction to Speech and Natural Language Processing (KOM422 ) Credits: 3(3-0)
Introduction to language modeling
Slides are from Dan Jurafsky and Schütze Language Modeling.
1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
N-gram Language Models
NLP. Introduction to NLP Extrinsic –Use in an application Intrinsic –Cheaper Correlate the two for validation purposes.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
Chapter 23: Probabilistic Language Models April 13, 2004.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
Introduction to N-grams Language Modeling. Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation: P(high.
12/6/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
CS 60050: Natural Language Processing Course Speech Recognition and Synthesis - I Presented By: Pratyush Banerjee Dept. of Computer Science and Engg. IIT.
Speech and Language Processing Lecture 4 Chapter 4 of SLP.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Introduction to N-grams Language Modeling IP disclosure: Content borrowed from J&M 3 rd edition and Raymond Mooney.
Introduction to N-grams
N-Grams Chapter 4 Part 2.
CSC 594 Topics in AI – Natural Language Processing
Introduction to N-grams
Introduction to N-grams
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Speech and Language Processing
Natural Language Processing
CPSC 503 Computational Linguistics
N-Gram Model Formulas Word sequences Chain rule of probability
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Introduction to N-grams
Presentation transcript:

Estimating N-gram Probabilities Language Modeling

Dan Jurafsky Estimating bigram probabilities

Dan Jurafsky An example Small Corpus: I am Sam Sam I am I do not like green eggs and ham

Dan Jurafsky More examples: Berkeley Restaurant Project sentences can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day

Dan Jurafsky Raw bigram counts Shows the bigram counts from the Berkeley Restaurant Project. Note that the majority of the values are zero. A matrix selected from a random set of seven words. 1

Dan Jurafsky Raw bigram probabilities Table2: Shows the unigram counts Table 3: Shows the bigram probabilities after normalization (dividing each row by the following unigram counts 3 2

Dan Jurafsky Bigram estimates of sentence probabilities Here are a few other useful probabilities: P(i| ) = 0.25 P(english|want) = P(food|english) = 0.5 P( |food) = 0.68 Compute the Bigram of sentence by using the info. above and table 3: P( I want english food ) = P(I| ) × P(want|I) × P(english|want) × P(food|english)× P( |food) =0.25 × 0.33× × 0.5× 0.68 =

Dan Jurafsky What kinds of knowledge? P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 It shows Fact about world Fact about grammar. Ex: "want” base verb require infinitive verb (to + v) Fact about grammar. Two verbs in row can not be allow in English)

Dan Jurafsky Practical Issues Since probabilities are (by definition) less than or equal to 1,w e do everything in log space because b y using log probabilities instead of raw probabilities, we get numbers that are not as small. But the more probabilities we multiply together, the smaller the product becomes. Multiplying enough N-grams together would result in numerical underflow Solve by using adding because adding is faster than multiplying

Dan Jurafsky Language Modeling Toolkits SRILM

Dan Jurafsky Google N-Gram Release, August 2006 …

Dan Jurafsky Google N-Gram Release

Dan Jurafsky Google Book N-grams

Evaluation and Perplexity Language Modeling

Dan Jurafsky Evaluation: How good is our model? Does our language model prefer good sentences to bad ones? Assign higher probability to “real” or “frequently observed” sentences Than “ungrammatical” or “rarely observed” sentences? 1.There are two type of Evaluation: Intrinsic Evaluation Extrinsic Evaluation

Dan Jurafsky An Intrinsic Evaluation of N-gram models First evaluation for comparing N-gram models A and B ( N-gram model use training set or training corpus such as Google N-gram corpus to compute the probabilities) 1.Put each n-gram model (A&B) in a task such as spelling corrector, speech recognizer, MT system 2.Run the task, get an accuracy for A model and B model. How? By testing each model’s performance on test set that haven’t seen and is different from our training set, totally unused Then using an evaluation metric which tells us how well our model does on the test set. 16

Dan Jurafsky Extrinsic evaluation of N-gram models another evaluation for comparing N-gram models A and B 1.Put each n-gram model (A &B) in a task such as spelling corrector, speech recognizer, MT system 2.Run the task, get an accuracy for A model and B model. How? Counting by hand how many misspelled words corrected properly (spelling corrector) Seeing which gives the more accurate transcription ( speech recognizer) Counting by hand how many words translated correctly (MT) 3.Compare accuracy for A model and B model handily.

Dan Jurafsky Difficulty of extrinsic evaluation of N-gram models Extrinsic evaluation Time-consuming; can take days or weeks So, Sometimes use Intrinsic evaluation But Bad approximation can be caused if test data part of training test Solve by choosing our test data that large as possible, not part of training test, and unseen(unused) to avoid the bad approximation. (For example: we can divide the large corpus that we want into training and test)

Dan Jurafsky More about Intrinsic evaluation In Intrinsic evaluation: 19 In practice we don’t use raw probability as our metric for evaluating language models, but a variant called perplexity. Perplexity of a language model on a test set (sometimes called PP for short) is the inverse probability of the test set, normalized by the number of words.

Dan Jurafsky Perplexity Perplexity is the inverse probability of the test set, normalized by the number of words: We can use the chain rule to expand the probability of W: If we are computing the perplexity of W with a bigram language model we get: Minimizing perplexity is the same as maximizing probability

Dan Jurafsky Perplexity as branching factor Let’s suppose a sentence consisting of random digits What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?

Dan Jurafsky Lower perplexity = better model Ex: We trained unigram, bigram, and trigram grammars on 38 million words (training set) from the Wall Street Journal, using a 19,979 word vocabulary. We then computed the perplexity of each of these models on a test set of 1.5 million words (test set) N-gram Order UnigramBigramTrigram Perplexity