Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
語言模型簡介 Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

What is Language Modeling?
Language Modeling (LM) is the art of determining the probability of word sequences Given a word sequence W, , the probability can be decomposed into a product of conditional probability:

n-gram Language Modeling
The parameters of is very large |V|i, V denotes the vocabulary n-gram assumption the probability of word wi only depends on previous n-1 words Trigram

n-gram Language Modeling
Maximum likelihood estimate C(wi-2wi-1wi) represent the number of occurrences of wi-2wi-1wi in the training corpus, and similarly for C(wi-2wi-1) There are many three word sequences that never occur in the training corpus, consider the sequence “party on Tuesday”, what is P(Tuesday | party on)? Data sparseness

Smoothing The training corpus might not contain any instances of the phrase, so C(party on Tuesday) would be 0, while there might still be 20 instances of the phrase “party on”  P(Tuesday | party on) = 0 Smoothing techniques take some probability away from some occurrences Imagine we have “party on Stan Chen’s birthday” in the training data and occurs only one time

Smoothing By taking some probability away from some words, such as “Stan” and redistributing it to other words, such as “Tuesday”, zero probabilities can be avoided Katz smoothing Jelinek-Mercer smoothing (deleted interpolation) Kneser-Ney smoothing

Smoothing: simple models
Add-one smoothing For example, pretend each trigram occurs once more than it actually does Add delta smoothing

Simply Interpolation where 0≦, ≦1
In practice, the uniform distribution are also interpolated this ensures that no word is assigned probability 0

Katz smoothing Katz smoothing is based on the Good-Turing formula
Let nr represent the number of n-grams that occur r times, the discounted count: The probability estimate for a n-gram with r counts: N is the size of the training data The size of the training data remains the same

Katz smoothing Sum=n1 (r+1)nr+1=0
Let N represent the total size of the training set, the left-over probability will be equal to n1/N Sum=n1

Katz backoff smoothing

Katz backoff smoothing
Consider a bigram model of a phrase such as Pkatz(Francisco | on). Since the phrase San Francisco is fairly common, the unigram probability will also be fairly high. This means that using Katz smoothing, the probability will also be fairly high. But, the word Francisco occurs in exceedingly few contexts, and its probability of occurring in a new one is very low

Kneser-Ney smoothing KN smoothing uses a modified backoff distribution based on the number of contexts each word occurs in, rather than the number of occurrences of the word. Thus, the probability PKN(Francisco | on) would be fairly low, while for a word like Tuesday that occurs in many contexts, PKN(Tuesday | on) would be relatively high, even if the phrase on Tuesday did not occur in the training data

Kneser-Ney smoothing Backoff Kneser-Ney smoothing
where |{v|C(vwi)>0}| is the number of words v that wi can occur in the context, D is the discount,  is a normalization constant such that the probabilities sum to 1

Kneser-Ney smoothing V={a,b,c,d} b b a a c c d a a b b b b c c a a b c

Kneser-Ney smoothing Interpolated models always combine both the higher-order and the lower-order distribution Interpolated Kneser-Ney smoothing where (wi-1) is a normalization constant such that the probabilities sum to 1

Kneser-Ney smoothing Multiple discounts, one for one counts, another for tow counts, and another for three or more counts. But it have too many parameters Modified Kneser-Ney smoothing

Jelinek-mercer smoothing
Combines different N-gram orders by linearly interpolating all three models whenever computing trigram

absolute discounting Absolute discounting subtracting a fixed discount D<=1 from each nonzero count

Witten-Bell Discounting
Key Concept—Things Seen Once: Use the count of things you’ve seen once to help estimate the count of things you’ve never seen So we estimate the total probability mass of all the zero N-grams with the number of types divided by the number of tokens plus observed types: N : the number of tokens T : observed types

T/(N+T) gives the total “probability of unseen N-grams”, we need to divide this up among all the zero N-grams We could just choose to divide it equally Z is the total number of N-grams with count zero

Alternatively, we can represent the smoothed counts directly as:

For bigram T: the number of bigram types, N: the number of bigram token

Evaluation A LM that assigned equal probability to 100 words would have perplexity 100

Evaluation In general, the perplexity of a LM is equal to the geometric average of the inverse probability of the words measured on test data:

Evaluation “true” model for any data source will have the lowest possible perplexity The lower the perplexity of our model, the closer it is, in some sense, to the true model Entropy, which is simply log2 of perplexity Entropy is the average number of bits per word that would be necessary to encode the test data using an optimal coder

Evaluation entropy : 54 perplexity : 3216 50%
reduction entropy .01 .1 .16 .2 .3 .4 .5 .75 1 perplexity 0.69% 6.7% 10% 13% 19% 24% 29% 41% 50% entropy : 54 perplexity : 32 % entropy : 54.5 perplexity : 32 %

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Similar presentations

Presentation on theme: "Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Similar presentations

Presentation on theme: "Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13"— Presentation transcript:

Similar presentations

About project

Feedback