Download presentation
Presentation is loading. Please wait.
Published byДино Ускоковић Modified over 6 years ago
1
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
語言模型簡介 Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
2
What is Language Modeling?
Language Modeling (LM) is the art of determining the probability of word sequences Given a word sequence W, , the probability can be decomposed into a product of conditional probability:
3
n-gram Language Modeling
The parameters of is very large |V|i, V denotes the vocabulary n-gram assumption the probability of word wi only depends on previous n-1 words Trigram
4
n-gram Language Modeling
Maximum likelihood estimate C(wi-2wi-1wi) represent the number of occurrences of wi-2wi-1wi in the training corpus, and similarly for C(wi-2wi-1) There are many three word sequences that never occur in the training corpus, consider the sequence “party on Tuesday”, what is P(Tuesday | party on)? Data sparseness
5
Smoothing The training corpus might not contain any instances of the phrase, so C(party on Tuesday) would be 0, while there might still be 20 instances of the phrase “party on” P(Tuesday | party on) = 0 Smoothing techniques take some probability away from some occurrences Imagine we have “party on Stan Chen’s birthday” in the training data and occurs only one time
6
Smoothing By taking some probability away from some words, such as “Stan” and redistributing it to other words, such as “Tuesday”, zero probabilities can be avoided Katz smoothing Jelinek-Mercer smoothing (deleted interpolation) Kneser-Ney smoothing
7
Smoothing: simple models
Add-one smoothing For example, pretend each trigram occurs once more than it actually does Add delta smoothing
8
Simply Interpolation where 0≦, ≦1
In practice, the uniform distribution are also interpolated this ensures that no word is assigned probability 0
9
Katz smoothing Katz smoothing is based on the Good-Turing formula
Let nr represent the number of n-grams that occur r times, the discounted count: The probability estimate for a n-gram with r counts: N is the size of the training data The size of the training data remains the same
10
Katz smoothing Sum=n1 (r+1)nr+1=0
Let N represent the total size of the training set, the left-over probability will be equal to n1/N Sum=n1
11
Katz backoff smoothing
12
Katz backoff smoothing
Consider a bigram model of a phrase such as Pkatz(Francisco | on). Since the phrase San Francisco is fairly common, the unigram probability will also be fairly high. This means that using Katz smoothing, the probability will also be fairly high. But, the word Francisco occurs in exceedingly few contexts, and its probability of occurring in a new one is very low
13
Kneser-Ney smoothing KN smoothing uses a modified backoff distribution based on the number of contexts each word occurs in, rather than the number of occurrences of the word. Thus, the probability PKN(Francisco | on) would be fairly low, while for a word like Tuesday that occurs in many contexts, PKN(Tuesday | on) would be relatively high, even if the phrase on Tuesday did not occur in the training data
14
Kneser-Ney smoothing Backoff Kneser-Ney smoothing
where |{v|C(vwi)>0}| is the number of words v that wi can occur in the context, D is the discount, is a normalization constant such that the probabilities sum to 1
15
Kneser-Ney smoothing V={a,b,c,d} b b a a c c d a a b b b b c c a a b c
16
Kneser-Ney smoothing Interpolated models always combine both the higher-order and the lower-order distribution Interpolated Kneser-Ney smoothing where (wi-1) is a normalization constant such that the probabilities sum to 1
17
Kneser-Ney smoothing Multiple discounts, one for one counts, another for tow counts, and another for three or more counts. But it have too many parameters Modified Kneser-Ney smoothing
18
Jelinek-mercer smoothing
Combines different N-gram orders by linearly interpolating all three models whenever computing trigram
19
absolute discounting Absolute discounting subtracting a fixed discount D<=1 from each nonzero count
20
Witten-Bell Discounting
Key Concept—Things Seen Once: Use the count of things you’ve seen once to help estimate the count of things you’ve never seen So we estimate the total probability mass of all the zero N-grams with the number of types divided by the number of tokens plus observed types: N : the number of tokens T : observed types
21
Witten-Bell Discounting
T/(N+T) gives the total “probability of unseen N-grams”, we need to divide this up among all the zero N-grams We could just choose to divide it equally Z is the total number of N-grams with count zero
22
Witten-Bell Discounting
Alternatively, we can represent the smoothed counts directly as:
23
Witten-Bell Discounting
24
Witten-Bell Discounting
For bigram T: the number of bigram types, N: the number of bigram token
25
Evaluation A LM that assigned equal probability to 100 words would have perplexity 100
26
Evaluation In general, the perplexity of a LM is equal to the geometric average of the inverse probability of the words measured on test data:
28
Evaluation “true” model for any data source will have the lowest possible perplexity The lower the perplexity of our model, the closer it is, in some sense, to the true model Entropy, which is simply log2 of perplexity Entropy is the average number of bits per word that would be necessary to encode the test data using an optimal coder
29
Evaluation entropy : 54 perplexity : 3216 50%
reduction entropy .01 .1 .16 .2 .3 .4 .5 .75 1 perplexity 0.69% 6.7% 10% 13% 19% 24% 29% 41% 50% entropy : 54 perplexity : 32 % entropy : 54.5 perplexity : 32 %
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.