Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Similar presentations


Presentation on theme: "Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13"— Presentation transcript:

1 Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
語言模型簡介 Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

2 What is Language Modeling?
Language Modeling (LM) is the art of determining the probability of word sequences Given a word sequence W, , the probability can be decomposed into a product of conditional probability:

3 n-gram Language Modeling
The parameters of is very large |V|i, V denotes the vocabulary n-gram assumption the probability of word wi only depends on previous n-1 words Trigram

4 n-gram Language Modeling
Maximum likelihood estimate C(wi-2wi-1wi) represent the number of occurrences of wi-2wi-1wi in the training corpus, and similarly for C(wi-2wi-1) There are many three word sequences that never occur in the training corpus, consider the sequence “party on Tuesday”, what is P(Tuesday | party on)? Data sparseness

5 Smoothing The training corpus might not contain any instances of the phrase, so C(party on Tuesday) would be 0, while there might still be 20 instances of the phrase “party on”  P(Tuesday | party on) = 0 Smoothing techniques take some probability away from some occurrences Imagine we have “party on Stan Chen’s birthday” in the training data and occurs only one time

6 Smoothing By taking some probability away from some words, such as “Stan” and redistributing it to other words, such as “Tuesday”, zero probabilities can be avoided Katz smoothing Jelinek-Mercer smoothing (deleted interpolation) Kneser-Ney smoothing

7 Smoothing: simple models
Add-one smoothing For example, pretend each trigram occurs once more than it actually does Add delta smoothing

8 Simply Interpolation where 0≦, ≦1
In practice, the uniform distribution are also interpolated this ensures that no word is assigned probability 0

9 Katz smoothing Katz smoothing is based on the Good-Turing formula
Let nr represent the number of n-grams that occur r times, the discounted count: The probability estimate for a n-gram with r counts: N is the size of the training data The size of the training data remains the same

10 Katz smoothing Sum=n1 (r+1)nr+1=0
Let N represent the total size of the training set, the left-over probability will be equal to n1/N Sum=n1

11 Katz backoff smoothing

12 Katz backoff smoothing
Consider a bigram model of a phrase such as Pkatz(Francisco | on). Since the phrase San Francisco is fairly common, the unigram probability will also be fairly high. This means that using Katz smoothing, the probability will also be fairly high. But, the word Francisco occurs in exceedingly few contexts, and its probability of occurring in a new one is very low

13 Kneser-Ney smoothing KN smoothing uses a modified backoff distribution based on the number of contexts each word occurs in, rather than the number of occurrences of the word. Thus, the probability PKN(Francisco | on) would be fairly low, while for a word like Tuesday that occurs in many contexts, PKN(Tuesday | on) would be relatively high, even if the phrase on Tuesday did not occur in the training data

14 Kneser-Ney smoothing Backoff Kneser-Ney smoothing
where |{v|C(vwi)>0}| is the number of words v that wi can occur in the context, D is the discount,  is a normalization constant such that the probabilities sum to 1

15 Kneser-Ney smoothing V={a,b,c,d} b b a a c c d a a b b b b c c a a b c

16 Kneser-Ney smoothing Interpolated models always combine both the higher-order and the lower-order distribution Interpolated Kneser-Ney smoothing where (wi-1) is a normalization constant such that the probabilities sum to 1

17 Kneser-Ney smoothing Multiple discounts, one for one counts, another for tow counts, and another for three or more counts. But it have too many parameters Modified Kneser-Ney smoothing

18 Jelinek-mercer smoothing
Combines different N-gram orders by linearly interpolating all three models whenever computing trigram

19 absolute discounting Absolute discounting subtracting a fixed discount D<=1 from each nonzero count

20 Witten-Bell Discounting
Key Concept—Things Seen Once: Use the count of things you’ve seen once to help estimate the count of things you’ve never seen So we estimate the total probability mass of all the zero N-grams with the number of types divided by the number of tokens plus observed types: N : the number of tokens T : observed types

21 Witten-Bell Discounting
T/(N+T) gives the total “probability of unseen N-grams”, we need to divide this up among all the zero N-grams We could just choose to divide it equally Z is the total number of N-grams with count zero

22 Witten-Bell Discounting
Alternatively, we can represent the smoothed counts directly as:

23 Witten-Bell Discounting

24 Witten-Bell Discounting
For bigram T: the number of bigram types, N: the number of bigram token

25 Evaluation A LM that assigned equal probability to 100 words would have perplexity 100

26 Evaluation In general, the perplexity of a LM is equal to the geometric average of the inverse probability of the words measured on test data:

27

28 Evaluation “true” model for any data source will have the lowest possible perplexity The lower the perplexity of our model, the closer it is, in some sense, to the true model Entropy, which is simply log2 of perplexity Entropy is the average number of bits per word that would be necessary to encode the test data using an optimal coder

29 Evaluation entropy : 54 perplexity : 3216 50%
reduction entropy .01 .1 .16 .2 .3 .4 .5 .75 1 perplexity 0.69% 6.7% 10% 13% 19% 24% 29% 41% 50% entropy : 54 perplexity : 32 % entropy : 54.5 perplexity : 32 %


Download ppt "Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13"

Similar presentations


Ads by Google