Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011.

Similar presentations


Presentation on theme: "Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011."— Presentation transcript:

1 Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011

2 Announcements Career exploration talk: Bill McNeill Thursday (10/20): 2:30-3:30pm Thomson 135 & Online (Treehouse URL) Treehouse meeting: Friday 10/21: 11-12 Thesis topic brainstorming GP Meeting: Friday 10/21: 3:30-5pm PCAR 291 & Online (…/clmagrad)

3 Roadmap Ngram language models Constructing language models Generative language models Evaluation: Training and Testing Perplexity Smoothing: Laplace smoothing Good-Turing smoothing Interpolation & backoff

4 Ngram Language Models Independence assumptions moderate data needs Approximate probability given all prior words Assume finite history Unigram: Probability of word in isolation Bigram: Probability of word given 1 previous Trigram: Probability of word given 2 previous N-gram approximation Bigram sequence

5 Berkeley Restaurant Project Sentences can you tell me about any good cantonese restaurants close by mid priced thai food is what im looking for tell me about chez panisse can you give me a listing of the kinds of food that are available im looking for a good place to eat breakfast when is caffe venezia open during the day

6 Bigram Counts Out of 9222 sentences Eg. I want occurred 827 times

7 Bigram Probabilities Divide bigram counts by prefix unigram counts to get probabilities.

8 Bigram Estimates of Sentence Probabilities P( I want english food ) = P(i| )* P(want|I)* P(english|want)* P(food|english)* P( |food) =.000031

9 Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 What types of knowledge are captured by ngram models?

10 Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge What types of knowledge are captured by ngram models?

11 Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge Syntax What types of knowledge are captured by ngram models?

12 Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge Syntax Discourse What types of knowledge are captured by ngram models?

13 Probabilistic Language Generation Coin-flipping models A sentence is generated by a randomized algorithm The generator can be in one of several states Flip coins to choose the next state Flip other coins to decide which letter or word to output

14 Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD

15 Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 2. First-order approximation: OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL

16 Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 2. First-order approximation: OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL 3. Second-order approximation: ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE

17 Word Models: Effects of N 1. First-order approximation: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE

18 Word Models: Effects of N 1. First-order approximation: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE 2. Second-order approximation: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

19 Shakespeare

20 The Wall Street Journal is Not Shakespeare

21 Evaluation

22 Evaluation - General Evaluation crucial for NLP systems Required for most publishable results Should be integrated early Many factors:

23 Evaluation - General Evaluation crucial for NLP systems Required for most publishable results Should be integrated early Many factors: Data Metrics Prior results …..

24 Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic)

25 Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting

26 Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting Compare to baseline and previous results Perform error analysis

27 Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting Compare to baseline and previous results Perform error analysis Show utility in real application (ideally)

28 Data Organization Training: Training data: used to learn model parameters

29 Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters

30 Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting

31 Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting Test data: Used for final, blind evaluation

32 Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting Test data: Used for final, blind evaluation Typical division of data: 80/10/10 Tradeoffs Cross-validation

33 Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, …

34 Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, … Intrinsic evaluation: Metric applied directly to model Independent of larger application Perplexity

35 Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, … Intrinsic evaluation: Metric applied directly to model Independent of larger application Perplexity Why not just extrinsic?

36 Perplexity

37 Intuition: A better model will have tighter fit to test data Will yield higher probability on test data

38 Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,

39 Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,

40 Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,

41 Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams:

42 Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams: Inversely related to probability of sequence Higher probability Lower perplexity

43 Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams: Inversely related to probability of sequence Higher probability Lower perplexity Can be viewed as average branching factor of model

44 Perplexity Example Alphabet: 0,1,…,9 Equiprobable

45 Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10

46 Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)=

47 Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)= If probability of 0 is higher, PP(W) will be

48 Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)= If probability of 0 is higher, PP(W) will be lower

49 Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V|

50 Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) =

51 Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) =

52 Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) = Perplexity is effective branching factor of language

53 Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N =

54 Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N =

55 Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N = =

56 Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N = = = 2 H(L,P) Where H is the entropy of the language L

57 Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode

58 Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn

59 Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn E.g. 8 things: number as code => 3 bits/trans Alt. short code if high prob; longer if lower Can reduce

60 Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i)

61 Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8

62 Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8 Some horses more likely: 1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

63 Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8 Some horses more likely: 1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

64 Entropy of a Sequence Basic sequence Entropy of language: infinite lengths Assume stationary & ergodic

65 Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS)

66 Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word,

67 Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word, Remove n-gram from computation Increment oov_count N

68 Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a trigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~P(w 1 |BOS)*P(w 2 |w 1 BOS)*…*P(w n |w n-2 w n-1 )*P(EOS|w n-1 w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word, Remove n-gram from computation Increment oov_count N =sent_leng + 1 – oov_count

69 Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W)

70 Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) =

71 Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) = N

72 Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) = N = word_count + sent_count – oov_count

73 Perplexity Model Comparison Compare models with different history

74 Homework #4

75 Building Language Models Step 1: Count ngrams Step 2: Build model – Compute probabilities MLE Smoothed: Laplace, GT Step 3: Compute perplexity Steps 2 & 3 depend on model/smoothing choices

76 Q1: Counting N-grams Collect real counts from the training data: ngram_count.* training_data ngram_count_file Output ngrams and real count c(w1), c(w1, w2), and c(w1, w2, w3). Given a sentence: John called Mary Insert BOS and EOS: John called Mary

77 Q1: Output Count key 875a … 200 the book … 20thank you very In chunks – unigrams, then bigrams, then trigrams Sort in decreasing order of count within chunk

78 Q2: Create Language Model build_lm.* ngram_count_file lm_file Store the logprob of ngrams and other parameters in the lm There are actually three language models: P(w3), P(w3|w2) and P(w3|w1,w2) The output file is in a modified ARPA format (see next slide) Lines for n-grams are sorted by n-gram counts

79 Modified ARPA Format \data\ ngram 1: type = xx; token = yy ngram 2: type = xx; token = yy ngram 3: type = xx; token = yy \1-grams: count prob logprob w1 \2-grams: count prob logprob w1 w2 \3-grams: count prob logprob w1 w2 w3 # xx: is type count # yy: is token count # prob is P(w) # prob is P(w2|w1) #count in C(w1w2)

80 Q3: Calculating Perplexity pp.* lm_file n test_file outfile Compute perplexity for n-gram history given model sum=0; count=0; for each s in test_file: if n-gram of history n exists Compute P(wi|…wi-n+1) sum += log_2 P(wi…) count ++ total = -sum/count pp(test_file) = 2 total

81 Output format Sent #1: Influential members of the House … 1: log P(Influential | ) = -inf(unknown word) 2: log P(members | Influential) = -inf (unseen ngrams) 4: log P(the | members of) = -0.673243382588536 1 sentence, 38 words, 9 OOVs logprob=-82.8860891791949 ppl=721.341645452964 %%%%%%%%%%%%%%%% sent_num=50 word_num=1175 oov_num=190 logprob=-2854.78157013778 ave_logprob=-2.75824306293506 pp=573.116699237283

82 Q4: Compute Perplexity Compute perplexity for different n


Download ppt "Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011."

Similar presentations


Ads by Google