Download presentation
Presentation is loading. Please wait.
Published byLydia Haynes Modified over 9 years ago
1
Chapter6. Statistical Inference : n-gram Model over Sparse Data 2005. 1. 13 이 동 훈 huni77@pusan.ac.kr Foundations of Statistic Natural Language Processing
2
2 / 20 Table of Contents Introduction Bins : Forming Equivalence Classes Reliability vs. Discrimination N-gram models Statistical Estimators Maximum Likelihood Estimation (MLE) Laplace’s law, Lidstone’s law and the Jeffreys-Perks law Held out estimation Cross-validation (deleted estimation) Good-Turing estimation Combining Estimators Simple linear interpolation Katz’s backing-off General linear interpolation Conclusions
3
3 / 20 Introduction Object of Statistical NLP Do statistical inference for the field of natural language. Statistical inference in general consists of : Taking some data generated by unknown probability distribution. Making some inferences about this distribution. Divides the problem into three areas : Dividing the training data into equivalence class. Finding a good statistical estimator for each equivalence class. Combining multiple estimators.
4
4 / 20 Bins : Forming Equivalence Classes[1/2] Reliability vs. Discrimination “large green ___________” tree? mountain? frog? car? “swallowed the large green ________” pill? broccoli? larger n: more information about the context of the specific instance (greater discrimination) smaller n: more instances in training data, better statistical estimates (more reliability)
5
5 / 20 Bins : Forming Equivalence Classes[2/2] N-gram models “n-gram” = sequence of n words predicting the next word : Markov assumption -Only the prior local con text - the last few words – affects the next word. Selecting an n : Vocabulary size = 20,000 words nNumber of bins 2 (bigrams)400,000,000 3 (trigrams)8,000,000,000,000 4 (4-grams)1.6 x 10 17
6
6 / 20 Statistical Estimators[1/3] Given the observed training data. How do you develop a model (probability distribution) to predict future events? Probability estimate target feature - Estimating the unknown probability distribution of n-grams.
7
7 / 20 Statistical Estimators[2/3] Notation for the statistical estimation chapter. NNumber of training instances BNumber of bins training instances are divided into w 1n An n-gram w 1 …w n in the training text C(w 1 …w n )Freq. of n-gram w 1 …w n in the training text rFreq. of an n-gram f()Freq. estimate of a model NrNr Number of bins that have r training instances in them TrTr Total count of n-grams of freq. r in further data h‘History’ of preceding words
8
8 / 20 Statistical Estimators[3/3] Example - Instances in the training corpus: “inferior to ________”
9
9 / 20 Maximum Likelihood Estimation (MLE)[1/2] Definition Using the relative frequency as a probability estimate. Example : In corpus, found 10 training instances of the word “comes across” 8 times they were followed by “as” : P(as) = 0.8 Once by “more” and “a” : P(more) = 0.1, P(a) = 0.1 Not among the above 3 word : P(x) = 0.0 Formula
10
10 / 20 Maximum Likelihood Estimation (MLE)[2/2]
11
11 / 20 Laplace’s law, Lidstone’s law and the Jeffreys-Perks law [1/2] Laplace’s law Add a little bit of probability space to unseen events
12
12 / 20 Laplace’s law, Lidstone’s law and the Jeffreys-Perks law [2/2] Lidstone’s law and the Jeffreys-Perks law Lidstone’s Law -add some positive value Jeffreys-Perks Law - = 0.5 -Called ELE (Expected Likelihood Estimation)
13
13 / 20 Held out estimation Validate by holding out part of the training data. -C 1 (w 1n ) = Frequency of w 1n in training data -C 2 (w 1n ) = Frequency of w 1n in held out data -T = Number of token in held out data
14
14 / 20 Cross-validation (deleted estimation) [1/2] Use data for both training and validation Divide test data into 2 parts Train on A, validate on B Train on B, validate on A Combine two models AB trainvalidate train Model 1 Model 2 Model 1Model 2 + Final Model
15
15 / 20 Cross-validation (deleted estimation) [2/2] Cross validation : training data is used both as initial training data held out data On large training corpora, deleted estimation works better than held-out estimation
16
16 / 20 Good-Turing estimation Suitable for large number of observations from a large vocabulary Works well for n-grams ( r* is an adjusted frequency ) ( E denotes the expectation of random variable )
17
17 / 20 Combining Estimators[1/3] Basic Idea Consider how to combine multiple probability estimate from various different models How can you develop a model to utilize different length n-grams as appropriate? Simple linear interpolation Combination of trigram, bigram and unigram
18
18 / 20 Combining Estimators[2/3] Katz’s backing-off used to smooth or to combine information source n-gram appeared more than k time n-gram estimate k or less than k estimate from a shorter n-gram
19
19 / 20 Combining Estimators[3/3] General linear interpolation weight : function of history Very general way to combine models (commonly used)
20
20 / 20 Conclusions problems of sparse data Good-Turing, linear interpolation or back-off Good-Turing smoothing is good -Church & Gale (1991) Active research combining probability models dealing with sparse data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.