Download presentation
Presentation is loading. Please wait.
Published byGarret Wescott Modified over 10 years ago
1
Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011
2
Announcements Career exploration talk: Bill McNeill Thursday (10/20): 2:30-3:30pm Thomson 135 & Online (Treehouse URL) Treehouse meeting: Friday 10/21: 11-12 Thesis topic brainstorming GP Meeting: Friday 10/21: 3:30-5pm PCAR 291 & Online (…/clmagrad)
3
Roadmap Ngram language models Constructing language models Generative language models Evaluation: Training and Testing Perplexity Smoothing: Laplace smoothing Good-Turing smoothing Interpolation & backoff
4
Ngram Language Models Independence assumptions moderate data needs Approximate probability given all prior words Assume finite history Unigram: Probability of word in isolation Bigram: Probability of word given 1 previous Trigram: Probability of word given 2 previous N-gram approximation Bigram sequence
5
Berkeley Restaurant Project Sentences can you tell me about any good cantonese restaurants close by mid priced thai food is what im looking for tell me about chez panisse can you give me a listing of the kinds of food that are available im looking for a good place to eat breakfast when is caffe venezia open during the day
6
Bigram Counts Out of 9222 sentences Eg. I want occurred 827 times
7
Bigram Probabilities Divide bigram counts by prefix unigram counts to get probabilities.
8
Bigram Estimates of Sentence Probabilities P( I want english food ) = P(i| )* P(want|I)* P(english|want)* P(food|english)* P( |food) =.000031
9
Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 What types of knowledge are captured by ngram models?
10
Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge What types of knowledge are captured by ngram models?
11
Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge Syntax What types of knowledge are captured by ngram models?
12
Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge Syntax Discourse What types of knowledge are captured by ngram models?
13
Probabilistic Language Generation Coin-flipping models A sentence is generated by a randomized algorithm The generator can be in one of several states Flip coins to choose the next state Flip other coins to decide which letter or word to output
14
Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD
15
Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 2. First-order approximation: OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL
16
Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 2. First-order approximation: OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL 3. Second-order approximation: ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE
17
Word Models: Effects of N 1. First-order approximation: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE
18
Word Models: Effects of N 1. First-order approximation: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE 2. Second-order approximation: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED
19
Shakespeare
20
The Wall Street Journal is Not Shakespeare
21
Evaluation
22
Evaluation - General Evaluation crucial for NLP systems Required for most publishable results Should be integrated early Many factors:
23
Evaluation - General Evaluation crucial for NLP systems Required for most publishable results Should be integrated early Many factors: Data Metrics Prior results …..
24
Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic)
25
Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting
26
Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting Compare to baseline and previous results Perform error analysis
27
Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting Compare to baseline and previous results Perform error analysis Show utility in real application (ideally)
28
Data Organization Training: Training data: used to learn model parameters
29
Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters
30
Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting
31
Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting Test data: Used for final, blind evaluation
32
Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting Test data: Used for final, blind evaluation Typical division of data: 80/10/10 Tradeoffs Cross-validation
33
Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, …
34
Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, … Intrinsic evaluation: Metric applied directly to model Independent of larger application Perplexity
35
Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, … Intrinsic evaluation: Metric applied directly to model Independent of larger application Perplexity Why not just extrinsic?
36
Perplexity
37
Intuition: A better model will have tighter fit to test data Will yield higher probability on test data
38
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,
39
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,
40
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,
41
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams:
42
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams: Inversely related to probability of sequence Higher probability Lower perplexity
43
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams: Inversely related to probability of sequence Higher probability Lower perplexity Can be viewed as average branching factor of model
44
Perplexity Example Alphabet: 0,1,…,9 Equiprobable
45
Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10
46
Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)=
47
Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)= If probability of 0 is higher, PP(W) will be
48
Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)= If probability of 0 is higher, PP(W) will be lower
49
Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V|
50
Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) =
51
Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) =
52
Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) = Perplexity is effective branching factor of language
53
Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N =
54
Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N =
55
Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N = =
56
Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N = = = 2 H(L,P) Where H is the entropy of the language L
57
Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode
58
Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn
59
Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn E.g. 8 things: number as code => 3 bits/trans Alt. short code if high prob; longer if lower Can reduce
60
Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i)
61
Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8
62
Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8 Some horses more likely: 1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64
63
Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8 Some horses more likely: 1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64
64
Entropy of a Sequence Basic sequence Entropy of language: infinite lengths Assume stationary & ergodic
65
Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS)
66
Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word,
67
Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word, Remove n-gram from computation Increment oov_count N
68
Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a trigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~P(w 1 |BOS)*P(w 2 |w 1 BOS)*…*P(w n |w n-2 w n-1 )*P(EOS|w n-1 w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word, Remove n-gram from computation Increment oov_count N =sent_leng + 1 – oov_count
69
Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W)
70
Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) =
71
Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) = N
72
Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) = N = word_count + sent_count – oov_count
73
Perplexity Model Comparison Compare models with different history
74
Homework #4
75
Building Language Models Step 1: Count ngrams Step 2: Build model – Compute probabilities MLE Smoothed: Laplace, GT Step 3: Compute perplexity Steps 2 & 3 depend on model/smoothing choices
76
Q1: Counting N-grams Collect real counts from the training data: ngram_count.* training_data ngram_count_file Output ngrams and real count c(w1), c(w1, w2), and c(w1, w2, w3). Given a sentence: John called Mary Insert BOS and EOS: John called Mary
77
Q1: Output Count key 875a … 200 the book … 20thank you very In chunks – unigrams, then bigrams, then trigrams Sort in decreasing order of count within chunk
78
Q2: Create Language Model build_lm.* ngram_count_file lm_file Store the logprob of ngrams and other parameters in the lm There are actually three language models: P(w3), P(w3|w2) and P(w3|w1,w2) The output file is in a modified ARPA format (see next slide) Lines for n-grams are sorted by n-gram counts
79
Modified ARPA Format \data\ ngram 1: type = xx; token = yy ngram 2: type = xx; token = yy ngram 3: type = xx; token = yy \1-grams: count prob logprob w1 \2-grams: count prob logprob w1 w2 \3-grams: count prob logprob w1 w2 w3 # xx: is type count # yy: is token count # prob is P(w) # prob is P(w2|w1) #count in C(w1w2)
80
Q3: Calculating Perplexity pp.* lm_file n test_file outfile Compute perplexity for n-gram history given model sum=0; count=0; for each s in test_file: if n-gram of history n exists Compute P(wi|…wi-n+1) sum += log_2 P(wi…) count ++ total = -sum/count pp(test_file) = 2 total
81
Output format Sent #1: Influential members of the House … 1: log P(Influential | ) = -inf(unknown word) 2: log P(members | Influential) = -inf (unseen ngrams) 4: log P(the | members of) = -0.673243382588536 1 sentence, 38 words, 9 OOVs logprob=-82.8860891791949 ppl=721.341645452964 %%%%%%%%%%%%%%%% sent_num=50 word_num=1175 oov_num=190 logprob=-2854.78157013778 ave_logprob=-2.75824306293506 pp=573.116699237283
82
Q4: Compute Perplexity Compute perplexity for different n
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.