Download presentation
Presentation is loading. Please wait.
1
Language Modeling
2
Roadmap (for next two classes)
Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser-Ney
3
Language Model Evaluation Metrics
4
Applications
5
Entropy and perplexity
Entropy β measure information content, in bits π» π = π₯ π π₯ Γ β log 2 π π₯ β log 2 π π₯ is message length with ideal code Use log 2 if you want to measure in bits! Cross entropy β measure ability of trained model to compactly represent test data π€ 1 π π=1 π 1 π Γ β log 2 π π€ π π€ 1 πβ1 Average logprob of test data Perplexity β measure average branching factor 2 ππππ π πππ‘ππππ¦
6
Entropy and perplexity
Entropy β measure information content, in bits π» π = π₯ π π₯ Γ β log 2 π π₯ β log 2 π π₯ is message length with ideal code Use log 2 if you want to measure in bits! Cross entropy β measure ability of trained model to compactly represent test data π€ 1 π π=1 π 1 π Γ β log 2 π π€ π π€ 1 πβ1 Average logprob of test data Perplexity β measure average branching factor 2 ππππ π πππ‘ππππ¦
7
Entropy and perplexity
Entropy β measure information content, in bits π» π = π₯ π π₯ Γ β log 2 π π₯ β log 2 π π₯ is message length with ideal code Use log 2 if you want to measure in bits! Cross entropy β measure ability of trained model to compactly represent test data π€ 1 π π=1 π 1 π Γ β log 2 π π€ π π€ 1 πβ1 Average logprob of test data Perplexity β measure average branching factor 2 ππππ π πππ‘ππππ¦
8
Entropy and perplexity
Entropy β measure information content, in bits π» π = π₯ π π₯ Γ β log 2 π π₯ β log 2 π π₯ is message length with ideal code Use log 2 if you want to measure in bits! Cross entropy β measure ability of trained model to compactly represent test data π€ 1 π π=1 π 1 π Γ β log 2 π π€ π π€ 1 πβ1 Average logprob of test data Perplexity β measure average branching factor 2 ππππ π πππ‘ππππ¦
9
Language model perplexity
Recipe: Train a language model on training data Get negative logprobs of test data, compute average Exponentiate! Perplexity correlates rather well with: Speech recognition error rates MT quality metrics LM Perplexities for word-based models are normally between say 50 and 1000 Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact
10
Parameter estimation What is it?
11
Parameter estimation Model form is fixed (coin unigrams, word bigrams, β¦) We have observations H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation β pick the parameters that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 MLE picks parameters best for training dataβ¦ β¦but these donβt generalize well to test data β zeros!
12
Parameter estimation Model form is fixed (coin unigrams, word bigrams, β¦) We have observations H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation β pick the parameters that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 MLE picks parameters best for training dataβ¦ β¦but these donβt generalize well to test data β zeros!
13
Parameter estimation Model form is fixed (coin unigrams, word bigrams, β¦) We have observations H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation β pick the parameters that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 MLE picks parameters best for training dataβ¦ β¦but these donβt generalize well to test data β zeros!
14
Smoothing Take mass from seen events, give to unseen events
Robin Hood for probability models MLE at one end of the spectrum; uniform distribution the other Need to pick a happy medium, and yet maintain a distribution π₯ π π₯ =1 π π₯ β₯0 βπ₯
15
Smoothing techniques Laplace Good-Turing Backoff Mixtures
Interpolation Kneser-Ney
16
Laplace From MLE: π π₯ = π π₯ π₯ β² π π₯ β² To Laplace:
π π₯ = π π₯ π₯ β² π π₯ β² To Laplace: π π₯ = π π₯ +1 π₯ β² π π₯ β² +1
17
Good-Turing Smoothing
New idea: Use counts of things you have seen to estimate those you havenβt
18
Good-Turing Josh Goodman Intuition
Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? Slide adapted from Josh Goodman, Dan Jurafsky
19
Good-Turing Josh Goodman Intuition
Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18 Assuming so, how likely is it that next species is trout? Slide adapted from Josh Goodman, Dan Jurafsky
20
Good-Turing Josh Goodman Intuition
Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18 Assuming so, how likely is it that next species is trout? Must be less than 1/18 Slide adapted from Josh Goodman, Dan Jurafsky
21
Some more hypotheticals
Species Puget Sound Lake Washington Greenlake Salmon 8 12 Trout 3 1 Cod Rockfish Snapper Skate Bass 14 TOTAL 15 How likely is it to find a new fish in each of these places?
22
Good-Turing Smoothing
New idea: Use counts of things you have seen to estimate those you havenβt
23
Good-Turing Smoothing
New idea: Use counts of things you have seen to estimate those you havenβt Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams
24
Good-Turing Smoothing
New idea: Use counts of things you have seen to estimate those you havenβt Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams Notation: Nc is the frequency of frequency c Number of ngrams which appear c times N0: # ngrams of count 0; N1: # of ngrams of count 1 π π = π₯:π π₯ =π 1
25
Good-Turing Smoothing
Estimate probability of things which occur c times with the probability of things which occur c+1 times Discounted counts: steal mass from seen cases to provide for the unseen: π β = π+1 π π+1 π π MLE π π₯ = π π₯ π GT π π₯ = π β π₯ N
26
GT Fish Example
27
Enough about the fish⦠how does this relate to language?
Name some linguistic situations where the number of new words would differ
28
Enough about the fish⦠how does this relate to language?
Name some linguistic situations where the number of new words would differ Different languages: Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish!
29
Enough about the fish⦠how does this relate to language?
Name some linguistic situations where the number of new words would differ Different languages: Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish! Different domains: Airplane maintenance manuals: controlled vocabulary Random web posts: uncontrolled vocab
30
Bigram Frequencies of Frequencies and GT Re-estimates
31
Good-Turing Smoothing
N-gram counts to conditional probability π π€ π π€ 1 .. π€ πβ1 = π β π€ 1 β¦ π€ π π β π€ 1 β¦ π€ πβ1 Use c* from GT estimate
32
Additional Issues in Good-Turing
General approach: Estimate of c* for Nc depends on N c+1 What if Nc+1 = 0? More zero count problems Not uncommon: e.g. fish example, no 4s
33
Modifications Simple Good-Turing
Compute Nc bins, then smooth Nc to replace zeroes Fit linear regression in log space log(Nc) = a +b log(c) What about large cβs? Should be reliable Assume c*=c if c is large, e.g c > k (Katz: k =5) Typically combined with other approaches
34
Backoff and Interpolation
Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from:
35
Backoff and Interpolation
Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from: Bigram p(z|y) Or even: Unigram p(z)
36
Backoff and Interpolation
Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from: Bigram p(z|y) Or even: Unigram p(z) How to combine this trigram, bigram, unigram info in a valid fashion?
37
Backoff vs. Interpolation
Backoff: use trigram if you have it, otherwise bigram, otherwise unigram
38
Backoff vs. Interpolation
Backoff: use trigram if you have it, otherwise bigram, otherwise unigram Interpolation: always mix all three
39
Backoff Bigram distribution π π π = π ππ π π But π π could be zeroβ¦
π π π = π ππ π π But π π could be zeroβ¦ What if we fell back (or βbacked offβ) to a unigram distribution? π π π = π ππ π π if π π >0 π π π otherwise Also π ππ could be zeroβ¦
40
Backoff Whatβs wrong with this distribution?
π π π = π ππ π π if π ππ >0 π π π if π ππ =0, π π >0 π π π π π =0 Doesnβt sum to one! Need to steal massβ¦
41
Backoff π π π = π ππ βπ· π π if π ππ >0 πΌ π π π π if π ππ =0, π π >0 π π π π π =0 πΌ π = π β² :π π π β² β 0 1βπ π β² π π β² :π π π β² =0 π π β² π
42
Mixtures Given distributions π 1 (π₯) and π 2 π₯
Pick any number π between 0 and 1 π π₯ =π π 1 π₯ + 1βπ π 2 π₯ is a distribution (Laplace is a mixture!)
43
Interpolation Simple interpolation
π β π€ π π€ πβ1 =π π π€ πβ1 π€ π π π€ πβ βπ π π€ π π πβ 0,1 Or, pick interpolation value based on context π β π€ π π€ πβ1 =π π€ πβ1 π π€ πβ1 π€ π π π€ πβ βπ π€ πβ1 π π€ π π Intuition: Higher weight on more frequent n-grams
44
How to Set the Lambdas? Use a held-out, or development, corpus
Choose lambdas which maximize the probability of some held-out data I.e. fix the N-gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held-out set Can use EM to do this search
45
Kneser-Ney Smoothing Most commonly used modern smoothing technique
Intuition: improving backoff I canβt see without my readingβ¦β¦ Compare P(Francisco|reading) vs P(glasses|reading)
46
Kneser-Ney Smoothing Most commonly used modern smoothing technique
Intuition: improving backoff I canβt see without my readingβ¦β¦ Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco)
47
Kneser-Ney Smoothing Most commonly used modern smoothing technique
Intuition: improving backoff I canβt see without my readingβ¦β¦ Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading)
48
Kneser-Ney Smoothing Most commonly used modern smoothing technique
Intuition: improving backoff I canβt see without my readingβ¦β¦ Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many
49
Kneser-Ney Smoothing Most commonly used modern smoothing technique
Intuition: improving backoff I canβt see without my readingβ¦β¦ Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many Interpolate based on # of contexts Words seen in more contexts, more likely to appear in others
50
Kneser-Ney Smoothing: bigrams
Modeling diversity of contexts π πππ£ π€ =# of contexts in which w occurs = π£:π π£π€ >0 So π πππ£ glasses β« π πππ£ (Francisco) π πππ£ π€ π = π πππ£ π€ π π€ β² π πππ£ π€ β²
51
Kneser-Ney Smoothing: bigrams
π πππ£ π€ =# of contexts in which w occurs = π£:π π£π€ >0 π πππ£ π€ π = π πππ£ π€ π π€ β² π πππ£ π€ β² Backoff: π π€ π π€ πβ1 = π π€ πβ1 π€ π βπ· π π€ πβ1 if π π€ πβ1 π€ π >0 πΌ π€ πβ1 π πππ£ π€ π otherwise
52
Kneser-Ney Smoothing: bigrams
π πππ£ π€ =# of contexts in which w occurs = π£:π π£π€ >0 π πππ£ π€ π = π πππ£ π€ π π€ β² π πππ£ π€ β² Interpolation: π π€ π π€ πβ1 = π π€ πβ1 π€ π βπ· π π€ πβ1 +π½ π€ πβ1 π πππ£ π€ π
53
OOV words: <UNK> word
Out Of Vocabulary = OOV words
54
OOV words: <UNK> word
Out Of Vocabulary = OOV words We donβt use GT smoothing for these
55
OOV words: <UNK> word
Out Of Vocabulary = OOV words We donβt use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token <UNK>
56
OOV words: <UNK> word
Out Of Vocabulary = OOV words We donβt use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token <UNK> Training of <UNK> probabilities Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to <UNK> Now we train its probabilities like a normal word
57
OOV words: <UNK> word
Out Of Vocabulary = OOV words We donβt use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token <UNK> Training of <UNK> probabilities Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to <UNK> Now we train its probabilities like a normal word At decoding time If text input: Use UNK probabilities for any word not in training Plus an additional penalty! UNK predicts the class of unknown words; then we need to pick a member
58
Class-Based Language Models
Variant of n-gram models using classes or clusters
59
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram
60
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data Where do classes come from?
61
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus
62
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus
63
LM Adaptation Challenge: Need LM for new domain
Have little in-domain data
64
LM Adaptation Challenge: Need LM for new domain
Have little in-domain data Intuition: Much of language is pretty general Can build from βgeneralβ LM + in-domain data
65
LM Adaptation Challenge: Need LM for new domain
Have little in-domain data Intuition: Much of language is pretty general Can build from βgeneralβ LM + in-domain data Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set What large corpus?
66
LM Adaptation Challenge: Need LM for new domain
Have little in-domain data Intuition: Much of language is pretty general Can build from βgeneralβ LM + in-domain data Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set What large corpus? Web counts! e.g. Google n-grams
67
Incorporating Longer Distance Context
Why use longer context?
68
Incorporating Longer Distance Context
Why use longer context? N-grams are approximation Model size Sparseness
69
Incorporating Longer Distance Context
Why use longer context? N-grams are approximation Model size Sparseness What sorts of information in longer context?
70
Incorporating Longer Distance Context
Why use longer context? N-grams are approximation Model size Sparseness What sorts of information in longer context? Priming Topic Sentence type Dialogue act Syntax
71
Long Distance LMs Bigger n!
284M words: <= 6-grams improve; 7-20 no better
72
Long Distance LMs Bigger n! Cache n-gram:
284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create βcacheβ unigram model on test corpus Mix with main n-gram LM
73
Long Distance LMs Bigger n! Cache n-gram: Topic models:
284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create βcacheβ unigram model on test corpus Mix with main n-gram LM Topic models: Intuition: Text is about some topic, on-topic words likely P(w|h) ~ Ξ£t P(w|t)P(t|h)
74
Long Distance LMs Bigger n! Cache n-gram: Topic models:
284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create βcacheβ unigram model on test corpus Mix with main n-gram LM Topic models: Intuition: Text is about some topic, on-topic words likely P(w|h) ~ Ξ£t P(w|t)P(t|h) Non-consecutive n-grams: skip n-grams, triggers, variable lengths n-grams
75
Language Models N-gram models: Issues: Zeroes and other sparseness
Finite approximation of infinite context history Issues: Zeroes and other sparseness Strategies: Smoothing Add-one, add-Ξ΄, Good-Turing, etc Use partial n-grams: interpolation, backoff Refinements Class, cache, topic, trigger LMs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.