Download presentation
Presentation is loading. Please wait.
Published bySharon O’Brien’ Modified over 9 years ago
1
Language Modeling 1
2
Roadmap (for next two classes) Review LMs What are they? How (and where) are they used? How are they trained? Evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser-Ney 2
3
What is a language model? Gives a probability of communication of transmitted signals of information (Claude Shannon, Information Theory) Lots of ties to Cryptography and Information Theory We most often use n-gram models 3
4
Applications 4
5
What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW 5 PhonemeExampleTranslation AHhutHH AH T AYhideHH AY D BbeB IY CHcheeseCH IY Z DdeeD IY EHEdEH D IYeatIY T KkeyK IY LleeL IY NkneeN IY PpeeP IY SseaS IY TteaT IY UWtwoT UW ZzeeZ IY
6
Applications What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH 6 PhonemeExampleTranslation AHhutHH AH T AYhideHH AY D BbeB IY CHcheeseCH IY Z DdeeD IY EHEdEH D IYeatIY T KkeyK IY LleeL IY NkneeN IY PpeeP IY SseaS IY TteaT IY UWtwoT UW ZzeeZ IY
7
Applications What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH Goal of LM: P(“I’d like to recognize speech”) > P(“I’d like to wreck a nice beach”) 7 PhonemeExampleTranslation AHhutHH AH T AYhideHH AY D BbeB IY CHcheeseCH IY Z DdeeD IY EHEdEH D IYeatIY T KkeyK IY LleeL IY NkneeN IY PpeeP IY SseaS IY TteaT IY UWtwoT UW ZzeeZ IY
8
Why n-gram LMs? 8
9
Language Model Evaluation Metrics 9
10
Entropy and perplexity Entropy measures the information content in a distribution == the uncertainty 10
11
Entropy and perplexity Entropy measures the information content in a distribution == the uncertainty If I can predict the next word before it comes, there’s no information content 11
12
Entropy and perplexity Entropy measures the information content in a distribution == the uncertainty If I can predict the next word before it comes, there’s no information content Zero uncertainty means the signal has zero information How many bits of additional information do I need to guess the next symbol? 12
13
Entropy and perplexity Entropy measures the information content in a distribution == the uncertainty If I can predict the next word before it comes, there’s no information content Zero uncertainty means the signal has zero information How many bits of additional information do I need to guess the next symbol? Perplexity is the average branching factor If message has zero information, 13
14
Entropy and perplexity Entropy measures the information content in a distribution == the uncertainty If I can predict the next word before it comes, there’s no information content Zero uncertainty means the signal has zero information How many bits of additional information do I need to guess the next symbol? Perplexity is the average branching factor If message has zero information, then branching factor is 1 14
15
Entropy and perplexity Entropy measures the information content in a distribution == the uncertainty If I can predict the next word before it comes, there’s no information content Zero uncertainty means the signal has zero information How many bits of additional information do I need to guess the next symbol? Perplexity is the average branching factor If message has zero information, then branching factor is 1 If message needs one bit, branching factor is 2 If message needs two bits, branching factor is 4 15
16
Entropy and perplexity Entropy measures the information content in a distribution == the uncertainty If I can predict the next word before it comes, there’s no information content Zero uncertainty means the signal has zero information How many bits of additional information do I need to guess the next symbol? Perplexity is the average branching factor If message has zero information, then branching factor is 1 If message needs one bit, branching factor is 2 If message needs two bits, branching factor is 4 Entropy and perplexity measure the same thing (uncertainty / information content) with different scales 16
17
Information in a fair coin flip 17 event spaceprobability entropy (bits)perplexity total112 heads0.5 tails0.5
18
Information in a fair coin flip 18 event spaceprobability entropy (bits)perplexity total112 heads0.5 tails0.5
19
Information in a fair coin flip 19 event spaceprobability entropy (bits)perplexity total112 heads0.5 tails0.5
20
Information in a single fair die 20 event spaceprobability entropy (bits)perplexity total112 11/60.43 21/60.43 31/60.43 41/60.43 51/60.43 61/60.43
21
Information in a single fair die 21 event spaceprobability entropy (bits)perplexity total112 11/60.43 21/60.43 31/60.43 41/60.43 51/60.43 61/60.43
22
Information in a single fair die 22 event spaceprobability entropy (bits)perplexity total116 11/60.43 21/60.43 31/60.43 41/60.43 51/60.43 61/60.43
23
Information in a single fair die 23 event spaceprobability entropy (bits)perplexity total16 11/60.43 21/60.43 31/60.43 41/60.43 51/60.43 61/60.43
24
Information in sum of two dice? event spaceprobability entropy (bits)perplexity total13.279.68 2 1/360.14 3 1/180.23 4 1/120.30 5 1/90.35 6 5/360.40 7 1/60.43 8 5/360.40 9 1/90.35 10 1/120.30 11 1/180.23 12 1/360.14 24
25
Entropy of a distribution 25
26
Entropy of a distribution 26
27
Computing Entropy 27
28
Computing Entropy 28 Ideal code length for this symbol
29
Computing Entropy 29 Ideal code length for this symbol
30
Entropy example 30
31
Entropy example 31
32
Perplexity 32
33
BIG SWITCH: Cross entropy and Language model perplexity 33
34
The Train/Test Split and Entropy 34
35
Cross entropy 35
36
Cross entropy, formally 36
37
Language model perplexity Recipe: Train a language model on training data Get negative logprobs of test data, compute average Exponentiate! Perplexity correlates rather well with: Speech recognition error rates MT quality metrics LM Perplexities for word-based models are normally between say 50 and 1000 Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact 37
38
Parameter estimation and smoothing 38
39
Tasks 39
40
Tasks 40
41
Tasks 41
42
Parameter estimation 42 Observations HHTHTHTTTHTH TTTTTTTTTTTT HHHTHHHTHHHH Parameters (Heads, Tails) (0.00, 1.00) (0.50, 0.50) (0.75, 0.25)
43
Parameter estimation techniques 43
44
Maximum Likelihood has problems :/ 44
45
Smoothing 45
46
Smoothing techniques 46
47
Mixtures / interpolation 47
48
Laplace as a mixture 48
49
BERP Corpus Bigrams Original bigram probabilites
50
BERP Smoothed Bigrams Smoothed bigram probabilities from the BERP
51
Laplace Smoothing Example Consider the case where |V|= 100K C(Bigram w 1 w 2 ) = 10; C(Trigram w 1 w 2 w 3 ) = 9
52
Laplace Smoothing Example Consider the case where |V|= 100K C(Bigram w 1 w 2 ) = 10; C(Trigram w 1 w 2 w 3 ) = 9 P MLE =
53
Laplace Smoothing Example Consider the case where |V|= 100K C(Bigram w 1 w 2 ) = 10; C(Trigram w 1 w 2 w 3 ) = 9 P MLE =9/10 = 0.9 P LAP =
54
Laplace Smoothing Example Consider the case where |V|= 100K C(Bigram w 1 w 2 ) = 10; C(Trigram w 1 w 2 w 3 ) = 9 P MLE =9/10 = 0.9 P LAP =(9+1)/10+100K ~ 0.0001
55
Laplace Smoothing Example Consider the case where |V|= 100K C(Bigram w 1 w 2 ) = 10; C(Trigram w 1 w 2 w 3 ) = 9 P MLE =9/10 = 0.9 P LAP =(9+1)/10+100K ~ 0.0001 Too much probability mass ‘shaved off’ for zeroes Too sharp a change in probabilities Problematic in practice
56
Add-δSmoothing Problem: Adding 1 moves too much probability mass
57
Add-δSmoothing Problem: Adding 1 moves too much probability mass Proposal: Add smaller fractional mass δ P add- δ (w i |w i-1 )
58
Add-δSmoothing Problem: Adding 1 moves too much probability mass Proposal: Add smaller fractional mass δ P add- δ (w i |w i-1 ) = Issues:
59
Add-δSmoothing Problem: Adding 1 moves too much probability mass Proposal: Add smaller fractional mass δ P add- δ (w i |w i-1 ) = Issues: Need to pick δ Still performs badly
60
Good-Turing Smoothing New idea: Use counts of things you have seen to estimate those you haven’t
61
Good-Turing Smoothing New idea: Use counts of things you have seen to estimate those you haven’t Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams
62
Good-Turing Smoothing New idea: Use counts of things you have seen to estimate those you haven’t Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams Notation: N c is the frequency of frequency c Number of ngrams which appear c times N 0 : # ngrams of count 0; N 1: # of ngrams of count 1
63
Good-Turing Smoothing Estimate probability of things which occur c times with the probability of things which occur c+1 times
64
Good-Turing Josh Goodman Intuition Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? Slide adapted from Josh Goodman, Dan Jurafsky
65
Good-Turing Josh Goodman Intuition Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18 Assuming so, how likely is it that next species is trout? Slide adapted from Josh Goodman, Dan Jurafsky
66
Good-Turing Josh Goodman Intuition Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18 Assuming so, how likely is it that next species is trout? Must be less than 1/18 Slide adapted from Josh Goodman, Dan Jurafsky
67
GT Fish Example
68
Bigram Frequencies of Frequencies and GT Re-estimates
69
Good-Turing Smoothing N-gram counts to conditional probability c * from GT estimate
70
Backoff and Interpolation Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from:
71
Backoff and Interpolation Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from: Bigram p(z|y) Or even: Unigram p(z)
72
Backoff and Interpolation Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from: Bigram p(z|y) Or even: Unigram p(z) How to combine this trigram, bigram, unigram info in a valid fashion?
73
Backoff Vs. Interpolation Backoff: use trigram if you have it, otherwise bigram, otherwise unigram
74
Backoff Vs. Interpolation Backoff: use trigram if you have it, otherwise bigram, otherwise unigram Interpolation: always mix all three
75
Interpolation Simple interpolation
76
Interpolation Simple interpolation Lambdas conditional on context: Intuition: Higher weight on more frequent n-grams
77
How to Set the Lambdas? Use a held-out, or development, corpus Choose lambdas which maximize the probability of some held-out data I.e. fix the N-gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held-out set Can use EM to do this search
78
Katz Backoff
79
Note: We used P* (discounted probabilities) and α weights on the backoff values Why not just use regular MLE estimates?
80
Katz Backoff Note: We used P* (discounted probabilities) and α weights on the backoff values Why not just use regular MLE estimates? Sum over all w i in n-gram context If we back off to lower n-gram?
81
Katz Backoff Note: We used P* (discounted probabilities) and α weights on the backoff values Why not just use regular MLE estimates? Sum over all w i in n-gram context If we back off to lower n-gram? Too much probability mass > 1
82
Katz Backoff Note: We used P* (discounted probabilities) and α weights on the backoff values Why not just use regular MLE estimates? Sum over all w i in n-gram context If we back off to lower n-gram? Too much probability mass > 1 Solution: Use P* discounts to save mass for lower order ngrams Apply α weights to make sure sum to amount saved Details in 4.7.1
83
Toolkits Two major language modeling toolkits SRILM Cambridge-CMU toolkit
84
Toolkits Two major language modeling toolkits SRILM Cambridge-CMU toolkit Publicly available, similar functionality Training: Create language model from text file Decoding: Computes perplexity/probability of text
85
OOV words: word Out Of Vocabulary = OOV words
86
OOV words: word Out Of Vocabulary = OOV words We don’t use GT smoothing for these
87
OOV words: word Out Of Vocabulary = OOV words We don’t use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token
88
OOV words: word Out Of Vocabulary = OOV words We don’t use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token Training of probabilities Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to Now we train its probabilities like a normal word
89
OOV words: word Out Of Vocabulary = OOV words We don’t use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token Training of probabilities Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to Now we train its probabilities like a normal word At decoding time If text input: Use UNK probabilities for any word not in training
90
Google N-Gram Release
91
serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234
92
Google Caveat Remember the lesson about test sets and training sets... Test sets should be similar to the training set (drawn from the same distribution) for the probabilities to be meaningful. So... The Google corpus is fine if your application deals with arbitrary English text on the Web. If not then a smaller domain specific corpus is likely to yield better results.
93
Class-Based Language Models Variant of n-gram models using classes or clusters
94
Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram
95
Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i ) Learn by MLE from data Where do classes come from?
96
Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i ) Learn by MLE from data Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus
97
Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i ) Learn by MLE from data Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus
98
LM Adaptation Challenge: Need LM for new domain Have little in-domain data
99
LM Adaptation Challenge: Need LM for new domain Have little in-domain data Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data
100
LM Adaptation Challenge: Need LM for new domain Have little in-domain data Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set What large corpus?
101
LM Adaptation Challenge: Need LM for new domain Have little in-domain data Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set What large corpus? Web counts! e.g. Google n-grams
102
Incorporating Longer Distance Context Why use longer context?
103
Incorporating Longer Distance Context Why use longer context? N-grams are approximation Model size Sparseness
104
Incorporating Longer Distance Context Why use longer context? N-grams are approximation Model size Sparseness What sorts of information in longer context?
105
Incorporating Longer Distance Context Why use longer context? N-grams are approximation Model size Sparseness What sorts of information in longer context? Priming Topic Sentence type Dialogue act Syntax
106
Long Distance LMs Bigger n! 284M words: <= 6-grams improve; 7-20 no better
107
Long Distance LMs Bigger n! 284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM
108
Long Distance LMs Bigger n! 284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM Topic models: Intuition: Text is about some topic, on-topic words likely P(w|h) ~ Σt P(w|t)P(t|h)
109
Long Distance LMs Bigger n! 284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM Topic models: Intuition: Text is about some topic, on-topic words likely P(w|h) ~ Σt P(w|t)P(t|h) Non-consecutive n-grams: skip n-grams, triggers, variable lengths n-grams
110
Language Models N-gram models: Finite approximation of infinite context history Issues: Zeroes and other sparseness Strategies: Smoothing Add-one, add-δ, Good-Turing, etc Use partial n-grams: interpolation, backoff Refinements Class, cache, topic, trigger LMs
111
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)
112
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco)
113
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading)
114
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many
115
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many Interpolate based on # of contexts Words seen in more contexts, more likely to appear in others
116
Kneser-Ney Smoothing Modeling diversity of contexts Continuation probability
117
Kneser-Ney Smoothing Modeling diversity of contexts Continuation probability Backoff:
118
Kneser-Ney Smoothing Modeling diversity of contexts Continuation probability Backoff: Interpolation:
119
Issues Relative frequency Typically compute count of sequence Divide by prefix Corpus sensitivity Shakespeare vs Wall Street Journal Very unnatural Ngrams Unigram: little; bigrams: colloc;trigrams:phrase
120
Additional Issues in Good-Turing General approach: Estimate of c * for N c depends on N c+1 What if N c+1 = 0? More zero count problems Not uncommon: e.g. fish example, no 4s
121
Modifications Simple Good-Turing Compute N c bins, then smooth N c to replace zeroes Fit linear regression in log space log(N c ) = a +b log(c) What about large c’s? Should be reliable Assume c * =c if c is large, e.g c > k (Katz: k =5) Typically combined with other interpolation/backoff
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.