Download presentation
Presentation is loading. Please wait.
1
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo
2
Examples of Good & Bad Language Models Excerption from Herman, comic strips by Jim Unger 1 2 34
3
What’s a Language Model A Language model is a probability distribution over word sequences P(“And nothing but the truth”) 0.001 P(“And nuts sing on the roof”) 0
4
What’s a language model for? Speech recognition Handwriting recognition Spelling correction Optical character recognition Machine translation (and anyone doing statistical modeling)
5
The Equation The observation can be image features (handwriting recognition), acoustics (speech recognition), word sequence in another language (MT), etc.
6
How Language Models work Hard to compute P(“And nothing but the truth”) Decompose probability P(“and nothing but the truth) = P(“and”) P(“nothing|and”) P(“but|and nothing”) P(“the|and nothing but”) P(“truth|and nothing but the”)
7
The Trigram Approximation Assume each word depends only on the previous two words P(“the|and nothing but”) P(“the|nothing but”) P(“truth|and nothing but the”) P(“truth|but the”)
8
How to find probabilities? Count from real text Pr(“the | nothing but”) c(“nothing but the”) / c(“nothing but”)
9
Evaluation How can you tell a good language model from a bad one? Run a speech recognizer (or your application of choice), calculate word error rate Slow Specific to your recognizer
10
Perplexity An example Data: “the whole truth and nothing but the truth” Lexicon: L={the, whole, truth, and, nothing, but} Model 1: uni-gram, Pr(L 1 )=…=Pr(L 6 )=1/6 Model 2: unigram, Pr(“the”)=Pr(“truth”)=1/4, Pr(“whole”)=Pr(“and”)=Pr(“nothing”)=Pr(“but”)=1/8
11
Perplexity: Is lower better? Remarkable fact: the “true” model for data has the lowest possible perplexity Lower the perplexity, the closer we are to true model. Perplexity correlates well with the error rate of recognition task Correlates better when both models are trained on same data Doesn’t correlate well when training data changes
12
Smoothing Terrible on test data: If no occurrences of C(xyz), probability is 0 P(sing|nuts) =0 leads to infinite perplexity!
13
Smoothing: Add One Add one smoothing: Add delta smoothing: Simple add-one smoothing does not perform well – the probability of rarely seen events is over-estimated
14
Smoothing: Simple Interpolation Interpolate Trigram, Bigram, Unigram for best combination Almost good enough
15
Smoothing: Redistribution of Probability Mass (Backing Off) [Katz87] Discounting Discounted probability mass Redistribution (n-1)-gram
16
Factor can be determined by the relative frequency of singletons, i.e., events observed exactly once in the data [Ney95] Linear Discount
17
Generalization : function of y, determined by cross-validation Requires more data Computation is expensive More General Formulation Drawback of linear discount The counts of frequently observed events are modified the most ; against the “law of large numbers”
18
The discount is an absolute value Works pretty well, easier than linear discounting Absolute Discounting
19
References [1] Katz S, Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Trans on Acoustics, Speech, and Signal Processing 35(3):400-401, 1987 [2] Ney H, Essen U, Kneser R, On the estimation of “small” probabilities by leaving-one-out, ITTT Trans. on PAMI 17(12): 1202-1212, 1995 [3] Joshua Goodman, A tutorial of language model: the State of The Art in Language Modeling, research.microsoft.com/~joshuago/lm-tutorial- public.ppt
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.