Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
N-Grams and Corpus Linguistics 6 July Linguistics vs. Engineering “But it must be recognized that the notion of “probability of a sentence” is an.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
CS 4705 N-Grams and Corpus Linguistics Julia Hirschberg CS 4705.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.
1 N-Grams and Corpus Linguistics September 2009 Lecture #5.
NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.
N-Grams and Corpus Linguistics.  Regular expressions for asking questions about the stock market from stock reports  Due midnight, Sept. 29 th  Use.
Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken.
Visual Recognition Tutorial
Part II. Statistical NLP Advanced Artificial Intelligence Markov Models and N-gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Kristian Kersting Some.
N-Grams and Corpus Linguistics
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Page 1 Language Modeling. Page 2 Next Word Prediction From a NY Times story... Stocks... Stocks plunged this …. Stocks plunged this morning, despite a.
CS 4705 Lecture 6 N-Grams and Corpus Linguistics.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
N-Grams and Language Modeling
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
CS 4705 N-Grams and Corpus Linguistics. Homework Use Perl or Java reg-ex package HW focus is on writing the “grammar” or FSA for dates and times The date.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
CS 4705 N-Grams and Corpus Linguistics. Spelling Correction, revisited M$ suggests: –ngram: NorAm –unigrams: anagrams, enigmas –bigrams: begrimes –trigrams:
CS 4705 Lecture 14 Corpus Linguistics II. Relating Conditionals and Priors P(A | B) = P(A ^ B) / P(B) –Or, P(A ^ B) = P(A | B) P(B) Bayes Theorem lets.
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
Crash Course on Machine Learning
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
Machine Translation Course 3 Diana Trandab ă ț Academic year:
NGrams 09/16/2004 Instructor: Rada Mihalcea Note: some of the material in this slide set was adapted from an NLP course taught by Bonnie Dorr at Univ.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Estimating N-gram Probabilities Language Modeling.
Natural Language Processing Statistical Inference: n-grams
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Statistical Methods for NLP Diana Trandab ă ț
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
Statistical Methods for NLP
N-Grams and Corpus Linguistics
N-Gram Model Formulas Word sequences Chain rule of probability
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Presentation transcript:

Natural Language Processing Language Model

Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a language For NLP, a probabilistic model of a language that gives a probability that a string is a member of a language is more useful To specify a correct probability distribution, the probability of all sentences in a language must sum to 1

Uses of Language Models Speech recognition – “I ate a cherry” is a more likely sentence than “Eye eight uh Jerry” OCR & Handwriting recognition – More probable sentences are more likely correct readings Machine translation – More likely sentences are probably better translations Generation – More likely sentences are probably better NL generations Context sensitive spelling correction – “Their are problems wit this sentence.”

Completion Prediction A language model also supports predicting the completion of a sentence – Please turn off your cell _____ – Your program does not ______ Predictive text input systems can guess what you are typing and give choices on how to complete it

Probability P(X) means probability that X is true – P(baby is a boy)  0.5 (% of total that are boys) – P(baby is named John)  (% of total named John) Babies Baby boys John

Probability P(X|Y) means probability that X is true when we already know Y is true – P(baby is named John | baby is a boy)  – P(baby is a boy | baby is named John )  1 Babies Baby boys John

Probability P(X|Y) = P(X, Y) / P(Y) – P( baby is named John | baby is a boy ) = P( baby is named John, baby is a boy ) / P (baby is a boy ) = / 0.5 = Babies Baby boys John

Bayes Rule Bayes rule: P(X|Y) = P(Y|X)  P(X) / P(Y) P( named John | boy ) = P( boy | named John )  P( named John ) / P( boy ) Babies Baby boys John

Word Sequence Probabilities Given a word sequence (sentence): its probability is: The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram model is a (N  1)-order Markov model

N-Gram Models Estimate probability of each word given prior context – P(phone | Please turn off your cell) Number of parameters required grows exponentially with the number of words of prior context An N-gram model uses only N  1 words of prior context. – Unigram: P(phone) – Bigram: P(phone | cell) – Trigram: P(phone | your cell) Bigram approximation: N-gram approximation:

Estimating Probabilities N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences. To have a consistent probabilistic model, append a unique start ( ) and end ( ) symbol to every sentence and treat these as additional words. Bigram: N-gram:

Generative Model and MLE An N-gram model can be seen as a probabilistic automata for generating sentences. Relative frequency estimates can be proven to be maximum likelihood estimates (MLE) since they maximize the probability that the model M will generate the training corpus T. Initialize sentence with N  1 symbols Until is generated do: Stochastically pick the next word based on the conditional probability of each word given the previous N  1 words. Initialize sentence with N  1 symbols Until is generated do: Stochastically pick the next word based on the conditional probability of each word given the previous N  1 words.

Example Estimate the likelihood of the sentence: I want to eat Chinese food P(I want to eat Chinese food) = P(I | ) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P( |food) What do we need to calculate these likelihoods? – Bigram probabilities for each word pair sequence in the sentence – Calculated from a large corpus

Corpus A language model must be trained on a large corpus of text to estimate good parameter values Model can be evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate) Ideally, the training (and test) corpus should be representative of the actual application data

Terminology Types: number of distinct words in a corpus (vocabulary size) Tokens: total number of words

Early Bigram Probabilities from BERP.001Eat British.03Eat today.007Eat dessert.04Eat Indian.01Eat tomorrow.04Eat a.02Eat Mexican.04Eat at.02Eat Chinese.05Eat dinner.02Eat in.06Eat lunch.03Eat breakfast.06Eat some.03Eat Thai.16Eat on

.01British lunch.05Want a.01British cuisine.65Want to.15British restaurant.04I have.60British food.08I don ’ t.02To be.29I would.09To spend.32I want.14To have.02 I ’ m.26To eat.04 Tell.01Want Thai.06 I ’ d.04Want some.25 I

Back to our sentence… I want to eat Chinese food Lunch Food Chinese Eat To Want I lunchFoodChineseEatToWantI

Relative Frequencies Normalization: divide each row's counts by appropriate unigram counts for w n-1 Computing the bigram probability of I I – C(I,I)/C( I) – p (I|I) = 8 / 3437 = LunchFoodChineseEatToWantI

Approximating Shakespeare Generating sentences with random unigrams... – Every enter now severally so, let – Hill he late speaks; or! a more to leg less first you enter With bigrams... – What means, sir. I confess she? then all sorts, he is trim, captain. – Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Trigrams – Sweet prince, Falstaff shall die. – This shall forbid it should be branded, if renown made it empty.

Approximating Shakespeare Quadrigrams – What! I will go seek the traitor Gloucester. – Will you not tell me who I am? – What's coming out here looks like Shakespeare because it is Shakespeare Note: As we increase the value of N, the accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained

Evaluation Perplexity and entropy: how do you estimate how well your language model fits a corpus once you ’ re done?

Random Variables A variable defined by the probabilities of each possible value in the population Discrete Random Variable – Whole Number (0, 1, 2, 3 etc.) – Countable, Finite Number of Values Jump from one value to the next and cannot take any values in between Continuous Random Variables – Whole or Fractional Number – Obtained by Measuring – Infinite Number of Values in Interval Too Many to List Like Discrete Variable

Discrete Random Variables For example: # of girls in the family # of of correct answers of a given exam …

Mass Probability Function Probability – x = Value of Random Variable (Outcome) – p(x) = Probability Associated with Value Mutually Exclusive (No Overlap) Collectively Exhaustive (Nothing Left Out) 0  p(x)  1  p(x) = 1

Measures Expected Value – Mean of Probability Distribution – Weighted Average of All Possible Values –  = E(X) =  x p(x) Variance – Weighted Average Squared Deviation about Mean –  2 = V(X)= E[ (x  (x  p(x) –  2 = V(X)=E(X  )  -[E(X)  Standard Deviation –  =   2 = SD(X) 

Perplexity and Entropy Information theoretic metrics – Useful in measuring how well a grammar or language model (LM) models a natural language or a corpus Entropy: With 2 LMs and a corpus, which LM is the better match for the corpus? How much information is there (in e.g. a grammar or LM) about what the next word will be? For a random variable X ranging over e.g. bigrams and a probability function p(x), the entropy of X is the expected negative log probability

Entropy- Example Horse race – 8 horses, we want to send a bet to the bookie In the naïve way – 3 bits message Can we do better? Suppose we know the distribution of the bets placed, i.e. - Horse1 ½Horse5 1/64 Horse2 ¼Horse6 1/64 Horse3 1/8Horse7 1/64 Horse4 1/16Horse8 1/64

Entropy - Example The Entropy of the random variable X give lower bound on the number of bits –

Perplexity The weighted average number of choices a random variable has to make In the previous example its 8

Entropy of a Language Measuring all the sequences of size n in a language L: Entropy rate:

Cross Entropy and Perplexity Given an approximation probability of the language p and a model m we want to measure how good m predicts p

Perplexity Better models m of the unknown distribution p will tend to assign higher probabilities m(x i ) to the test events. Thus, they have lower perplexity: they are less surprised by the test sample

Example Slide from Philip Keohn

Comparison 1-4 grams LM Slide from Philip Keohn

Smoothing Words follow a Zipfian distribution – Small number of words occur very frequently – A large number are seen only once Zero probabilities on one bigram cause a zero probability on the entire sentence So….how do we estimate the likelihood of unseen n-grams?

Steal from the rich and give to the poor (in probability mass) Slide from Dan Klein

Add-One Smoothing For all possible n-grams, add the count of one: – C = count of n-gram in corpus – N = count of history – V = vocabulary size But there are many more unseen n-grams than seen n-grams

Add-One Smoothing unsmoothed bigram counts: unsmoothed normalized bigram probabilities: 1 st word 2 nd word Vocabulary = 1616

Add-One Smoothing add-one smoothed bigram counts: add-one normalized bigram probabilities:

Add-One Problems bigrams starting with Chinese are boosted by a factor of 8! (1829 / 213) unsmoothed bigram counts: add-one smoothed bigram counts:

Add-One Problems Every previously unseen n-gram is given a low probability, but there are so many of them that too much probability mass is given to unseen events Adding 1 to frequent bigram, does not change much, but adding 1 to low bigrams (including unseen ones) boosts them too much! In NLP applications that are very sparse, Add-One actually gives far too much of the probability space to unseen events

Witten-Bell Smoothing Intuition: – An unseen n-gram is one that just did not occur yet – When it does happen, it will be its first occurrence – So give to unseen n-grams the probability of seeing a new n-gram

Witten-Bell – Unigram Case N: number of tokens T: number of types (diff. observed words) - can be different than V (number of words in dictionary) Prob. of unseen unigrams Prob. of seen unigrams

Witten-Bell – bigram case Prob. of unseen unigrams Prob. of seen unigrams

Witten-Bell Example The original counts were – T(w)= number of different seen bigrams types starting with w We have a vocabulary of 1616 words, so we can compute Z(w)= number of unseen bigrams types starting with w Z(w) = T(w) N(w) = number of bigrams tokens starting with w

Witten-Bell Example WB smoothed probabilities:

Back-off So far, we gave the same probability to all unseen n-grams – we have never seen the bigrams journal of P unsmoothed (of |journal) = 0 journal from P unsmoothed (from |journal) = 0 journal never P unsmoothed (never |journal) = 0 – all models so far will give the same probability to all 3 bigrams but intuitively, “journal of” is more probable because... – “of” is more frequent than “from” & “never” – unigram probability P(of) > P(from) > P(never)

Back-off Observation: – unigram model suffers less from data sparseness than bigram model – bigram model suffers less from data sparseness than trigram model – … So use a lower model estimate, to estimate probability of unseen n-grams If we have several models of how the history predicts what comes next, we can combine them in the hope of producing an even better model

Linear Interpolation Solve the sparseness in a trigram model by mixing with bigram and unigram models Also called: – linear interpolation – finite mixture models – deleted interpolation Combine linearly P li (w n |w n-2,w n-1 ) = 1 P(w n ) + 2 P(w n |w n-1 ) + 3 P(w n |w n-2,w n-1 ) – where 0  i  1 and  i i =1

Back-off Smoothing Smoothing of Conditional Probabilities p(Angeles | to, Los) If „to Los Angeles“ is not in the training corpus, the smoothed probability p(Angeles | to, Los) is identical to p(York | to, Los) However, the actual probability is probably close to the bigram probability p(Angeles | Los)

Back-off Smoothing (Wrong) Back-off Smoothing of trigram probabilities if C(w‘, w‘‘, w) > 0 P*(w | w‘, w‘‘) = P(w | w‘, w‘‘) else if C(w‘‘, w) > 0 P*(w | w‘, w‘‘) = P(w | w‘‘) else if C(w) > 0 P*(w | w‘, w‘‘) = P(w) else P*(w | w‘, w‘‘) = 1 / #words

Back-off Smoothing Problem: not a probability distribution Solution: Combination of Back-off and Smoothing if C(w 1,...,w k,w) > 0 P(w | w 1,...,w k ) = C* (w 1,...,w k,w) / N else P(w | w 1,...,w k ) =  (w 1,...,w k ) P(w | w 2,...,w k )