6. N-GRAMs 부산대학교 인공지능연구실 최성자
2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative communication -Context-sensitive spelling error correction
3 Language Model Language Model (LM) –statistical model of word sequences n-gram: Use the previous n -1 words to predict the next word
4 Applications context-sensitive spelling error detection and correction “He is trying to fine out.” “The design an construction will take a year.” machine translation
5 Counting Words in Corpora Corpora (on-line text collections) Which words to count –What we are going to count –Where we are going to find the things to count
6 Brown Corpus 1 million words 500 texts Varied genres (newspaper, novels, non- fiction, academic, etc.) Assembled at Brown University in The first large on-line text collection used in corpus-based NLP research
7 Issues in Word Counting Punctuation symbols (., ? !) Capitalization (“He” vs. “he”, “Bush” vs. “bush”) Inflected forms (“cat” vs. “cats”) –Wordform: cat, cats, eat, eats, ate, eating, eaten –Lemma (Stem): cat, eat
8 Types vs. Tokens Tokens (N): Total number of running words Types (B): Number of distinct words in a corpus (size of the vocabulary) Example: “They picnicked by the pool, then lay back on the grass and looked at the stars.” –16 word tokens, 14 word types (not counting punctuation) ※ “Types” will mean wordform types and not lemma type, and punctuation marks will generally be counted as word
9 How Many Words in English? Shakespeare’s complete works –884,647 wordform tokens –29,066 wordform types Brown Corpus –1 million wordform tokens –61,805 wordform types –37,851 lemma types
10 Simple (Unsmoothed) N-grams Task: Estimating the probability of a word First attempt: –Suppose there is no corpus available –Use uniform distribution –Assume: word types = V (e.g., 100,000)
11 Simple (Unsmoothed) N-grams Task: Estimating the probability of a word Second attempt: –Suppose there is a corpus –Assume: word tokens = N # times w appears in corpus = C(w)
12 Simple (Unsmoothed) N-grams Task: Estimating the probability of a word Third attempt: –Suppose there is a corpus –Assume a word depends on its n –1 previous words
13 Simple (Unsmoothed) N-grams
14 Simple (Unsmoothed) N-grams n-gram approximation: –W k only depends on its previous n–1words
15 Bigram Approximation Example: P(I want to eat British food) = P(I| ) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) : a special word meaning “start of sentence”
16 Note on Practical Problem Multiplying many probabilities results in a very small number and can cause numerical underflow Use logprob instead in the actual computation
17 Estimating N-gram Probability Maximum Likelihood Estimate (MLE)
18
19 Estimating Bigram Probability Example: –C(to eat) = 860 –C(to) = 3256
20
21 Two Important facts The increasing accuracy of N-gram models as we increse the value of N Very strong dependency on their training corpus (in particular its genre and its size in words)
22 Smoothing Any particular training corpus is finite Sparse data problem Deal with zero probability
23 Smoothing Smoothing –Reevaluating zero probability n-grams and assigning them non-zero probability Also called Discounting –Lowering non-zero n-gram counts in order to assign some probability mass to the zero n- grams
24 Add-One Smoothing for Bigram
25
26
27 Things Seen Once Use the count of things seen once to help estimate the count of things never seen
28 Witten-Bell Discounting
29 Witten-Bell Discounting for Bigram
30 Witten-Bell Discounting for Bigram
31 Seen count Unseen count
32
33 Good-Turing Discounting for Bigram
34
35 Backoff
36 Backoff
37 Entropy Measure of uncertainty Used to evaluate quality of n-gram models (how well a language model matches a given language) Entropy H(X) of a random variable X: Measured in bits Number of bits to encode information in the optimal coding scheme
38 Example 1
39 Example 2
40 Perplexity
41 Entropy of a Sequence
42 Entropy of a Language
43 Cross Entropy Used for comparing two language models p: Actual probability distribution that generated some data m: A model of p (approximation to p) Cross entropy of m on p:
44 Cross Entropy By Shannon-McMillan-Breimantheorem: Property of cross entropy: Difference between H(p,m) and H(p) is a measure of how accurate model m is The more accurate a model, the lower its cross-entropy