Download presentation
Presentation is loading. Please wait.
Published byCandice Mills Modified over 9 years ago
1
Natural Language Processing Language Model
2
Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a language For NLP, a probabilistic model of a language that gives a probability that a string is a member of a language is more useful To specify a correct probability distribution, the probability of all sentences in a language must sum to 1
3
Uses of Language Models Speech recognition – “I ate a cherry” is a more likely sentence than “Eye eight uh Jerry” OCR & Handwriting recognition – More probable sentences are more likely correct readings Machine translation – More likely sentences are probably better translations Generation – More likely sentences are probably better NL generations Context sensitive spelling correction – “Their are problems wit this sentence.”
4
Completion Prediction A language model also supports predicting the completion of a sentence – Please turn off your cell _____ – Your program does not ______ Predictive text input systems can guess what you are typing and give choices on how to complete it
5
Probability P(X) means probability that X is true – P(baby is a boy) 0.5 (% of total that are boys) – P(baby is named John) 0.001 (% of total named John) Babies Baby boys John
6
Probability P(X|Y) means probability that X is true when we already know Y is true – P(baby is named John | baby is a boy) 0.002 – P(baby is a boy | baby is named John ) 1 Babies Baby boys John
7
Probability P(X|Y) = P(X, Y) / P(Y) – P( baby is named John | baby is a boy ) = P( baby is named John, baby is a boy ) / P (baby is a boy ) = 0.001 / 0.5 = 0.002 Babies Baby boys John
8
Bayes Rule Bayes rule: P(X|Y) = P(Y|X) P(X) / P(Y) P( named John | boy ) = P( boy | named John ) P( named John ) / P( boy ) Babies Baby boys John
9
Word Sequence Probabilities Given a word sequence (sentence): its probability is: The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram model is a (N 1)-order Markov model
10
N-Gram Models Estimate probability of each word given prior context – P(phone | Please turn off your cell) Number of parameters required grows exponentially with the number of words of prior context An N-gram model uses only N 1 words of prior context. – Unigram: P(phone) – Bigram: P(phone | cell) – Trigram: P(phone | your cell) Bigram approximation: N-gram approximation:
11
Estimating Probabilities N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences. To have a consistent probabilistic model, append a unique start ( ) and end ( ) symbol to every sentence and treat these as additional words. Bigram: N-gram:
12
Generative Model and MLE An N-gram model can be seen as a probabilistic automata for generating sentences. Relative frequency estimates can be proven to be maximum likelihood estimates (MLE) since they maximize the probability that the model M will generate the training corpus T. Initialize sentence with N 1 symbols Until is generated do: Stochastically pick the next word based on the conditional probability of each word given the previous N 1 words. Initialize sentence with N 1 symbols Until is generated do: Stochastically pick the next word based on the conditional probability of each word given the previous N 1 words.
13
Example Estimate the likelihood of the sentence: I want to eat Chinese food P(I want to eat Chinese food) = P(I | ) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P( |food) What do we need to calculate these likelihoods? – Bigram probabilities for each word pair sequence in the sentence – Calculated from a large corpus
14
Corpus A language model must be trained on a large corpus of text to estimate good parameter values Model can be evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate) Ideally, the training (and test) corpus should be representative of the actual application data
15
Terminology Types: number of distinct words in a corpus (vocabulary size) Tokens: total number of words
16
Early Bigram Probabilities from BERP.001Eat British.03Eat today.007Eat dessert.04Eat Indian.01Eat tomorrow.04Eat a.02Eat Mexican.04Eat at.02Eat Chinese.05Eat dinner.02Eat in.06Eat lunch.03Eat breakfast.06Eat some.03Eat Thai.16Eat on
17
.01British lunch.05Want a.01British cuisine.65Want to.15British restaurant.04I have.60British food.08I don ’ t.02To be.29I would.09To spend.32I want.14To have.02 I ’ m.26To eat.04 Tell.01Want Thai.06 I ’ d.04Want some.25 I
18
Back to our sentence… I want to eat Chinese food 0100004Lunch 000017019Food 112000002Chinese 522190200Eat 12038601003To 686078603Want 00013010878I lunchFoodChineseEatToWantI
19
Relative Frequencies Normalization: divide each row's counts by appropriate unigram counts for w n-1 Computing the bigram probability of I I – C(I,I)/C( I) – p (I|I) = 8 / 3437 =.0023 4591506213938325612153437 LunchFoodChineseEatToWantI
20
Approximating Shakespeare Generating sentences with random unigrams... – Every enter now severally so, let – Hill he late speaks; or! a more to leg less first you enter With bigrams... – What means, sir. I confess she? then all sorts, he is trim, captain. – Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Trigrams – Sweet prince, Falstaff shall die. – This shall forbid it should be branded, if renown made it empty.
21
Approximating Shakespeare Quadrigrams – What! I will go seek the traitor Gloucester. – Will you not tell me who I am? – What's coming out here looks like Shakespeare because it is Shakespeare Note: As we increase the value of N, the accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained
22
Evaluation Perplexity and entropy: how do you estimate how well your language model fits a corpus once you ’ re done?
23
Random Variables A variable defined by the probabilities of each possible value in the population Discrete Random Variable – Whole Number (0, 1, 2, 3 etc.) – Countable, Finite Number of Values Jump from one value to the next and cannot take any values in between Continuous Random Variables – Whole or Fractional Number – Obtained by Measuring – Infinite Number of Values in Interval Too Many to List Like Discrete Variable
24
Discrete Random Variables For example: # of girls in the family # of of correct answers of a given exam …
25
Mass Probability Function Probability – x = Value of Random Variable (Outcome) – p(x) = Probability Associated with Value Mutually Exclusive (No Overlap) Collectively Exhaustive (Nothing Left Out) 0 p(x) 1 p(x) = 1
26
Measures Expected Value – Mean of Probability Distribution – Weighted Average of All Possible Values – = E(X) = x p(x) Variance – Weighted Average Squared Deviation about Mean – 2 = V(X)= E[ (x (x p(x) – 2 = V(X)=E(X ) -[E(X) Standard Deviation – = 2 = SD(X)
27
Perplexity and Entropy Information theoretic metrics – Useful in measuring how well a grammar or language model (LM) models a natural language or a corpus Entropy: With 2 LMs and a corpus, which LM is the better match for the corpus? How much information is there (in e.g. a grammar or LM) about what the next word will be? For a random variable X ranging over e.g. bigrams and a probability function p(x), the entropy of X is the expected negative log probability
28
Entropy- Example Horse race – 8 horses, we want to send a bet to the bookie In the naïve way – 3 bits message Can we do better? Suppose we know the distribution of the bets placed, i.e. - Horse1 ½Horse5 1/64 Horse2 ¼Horse6 1/64 Horse3 1/8Horse7 1/64 Horse4 1/16Horse8 1/64
29
Entropy - Example The Entropy of the random variable X give lower bound on the number of bits –
30
Perplexity The weighted average number of choices a random variable has to make In the previous example its 8
31
Entropy of a Language Measuring all the sequences of size n in a language L: Entropy rate:
32
Cross Entropy and Perplexity Given an approximation probability of the language p and a model m we want to measure how good m predicts p
33
Perplexity Better models m of the unknown distribution p will tend to assign higher probabilities m(x i ) to the test events. Thus, they have lower perplexity: they are less surprised by the test sample
34
Example Slide from Philip Keohn
35
Comparison 1-4 grams LM Slide from Philip Keohn
36
Smoothing Words follow a Zipfian distribution – Small number of words occur very frequently – A large number are seen only once Zero probabilities on one bigram cause a zero probability on the entire sentence So….how do we estimate the likelihood of unseen n-grams?
37
Steal from the rich and give to the poor (in probability mass) Slide from Dan Klein
38
Add-One Smoothing For all possible n-grams, add the count of one: – C = count of n-gram in corpus – N = count of history – V = vocabulary size But there are many more unseen n-grams than seen n-grams
39
Add-One Smoothing unsmoothed bigram counts: unsmoothed normalized bigram probabilities: 1 st word 2 nd word Vocabulary = 1616
40
Add-One Smoothing add-one smoothed bigram counts: add-one normalized bigram probabilities:
41
Add-One Problems bigrams starting with Chinese are boosted by a factor of 8! (1829 / 213) unsmoothed bigram counts: add-one smoothed bigram counts:
42
Add-One Problems Every previously unseen n-gram is given a low probability, but there are so many of them that too much probability mass is given to unseen events Adding 1 to frequent bigram, does not change much, but adding 1 to low bigrams (including unseen ones) boosts them too much! In NLP applications that are very sparse, Add-One actually gives far too much of the probability space to unseen events
43
Witten-Bell Smoothing Intuition: – An unseen n-gram is one that just did not occur yet – When it does happen, it will be its first occurrence – So give to unseen n-grams the probability of seeing a new n-gram
44
Witten-Bell – Unigram Case N: number of tokens T: number of types (diff. observed words) - can be different than V (number of words in dictionary) Prob. of unseen unigrams Prob. of seen unigrams
45
Witten-Bell – bigram case Prob. of unseen unigrams Prob. of seen unigrams
46
Witten-Bell Example The original counts were – T(w)= number of different seen bigrams types starting with w We have a vocabulary of 1616 words, so we can compute Z(w)= number of unseen bigrams types starting with w Z(w) = 1616 - T(w) N(w) = number of bigrams tokens starting with w
47
Witten-Bell Example WB smoothed probabilities:
48
Back-off So far, we gave the same probability to all unseen n-grams – we have never seen the bigrams journal of P unsmoothed (of |journal) = 0 journal from P unsmoothed (from |journal) = 0 journal never P unsmoothed (never |journal) = 0 – all models so far will give the same probability to all 3 bigrams but intuitively, “journal of” is more probable because... – “of” is more frequent than “from” & “never” – unigram probability P(of) > P(from) > P(never)
49
Back-off Observation: – unigram model suffers less from data sparseness than bigram model – bigram model suffers less from data sparseness than trigram model – … So use a lower model estimate, to estimate probability of unseen n-grams If we have several models of how the history predicts what comes next, we can combine them in the hope of producing an even better model
50
Linear Interpolation Solve the sparseness in a trigram model by mixing with bigram and unigram models Also called: – linear interpolation – finite mixture models – deleted interpolation Combine linearly P li (w n |w n-2,w n-1 ) = 1 P(w n ) + 2 P(w n |w n-1 ) + 3 P(w n |w n-2,w n-1 ) – where 0 i 1 and i i =1
51
Back-off Smoothing Smoothing of Conditional Probabilities p(Angeles | to, Los) If „to Los Angeles“ is not in the training corpus, the smoothed probability p(Angeles | to, Los) is identical to p(York | to, Los) However, the actual probability is probably close to the bigram probability p(Angeles | Los)
52
Back-off Smoothing (Wrong) Back-off Smoothing of trigram probabilities if C(w‘, w‘‘, w) > 0 P*(w | w‘, w‘‘) = P(w | w‘, w‘‘) else if C(w‘‘, w) > 0 P*(w | w‘, w‘‘) = P(w | w‘‘) else if C(w) > 0 P*(w | w‘, w‘‘) = P(w) else P*(w | w‘, w‘‘) = 1 / #words
53
Back-off Smoothing Problem: not a probability distribution Solution: Combination of Back-off and Smoothing if C(w 1,...,w k,w) > 0 P(w | w 1,...,w k ) = C* (w 1,...,w k,w) / N else P(w | w 1,...,w k ) = (w 1,...,w k ) P(w | w 2,...,w k )
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.