Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Modeling 1. Roadmap (for next two classes)  Review LMs  What are they?  How (and where) are they used?  How are they trained?  Evaluation.

Similar presentations


Presentation on theme: "Language Modeling 1. Roadmap (for next two classes)  Review LMs  What are they?  How (and where) are they used?  How are they trained?  Evaluation."— Presentation transcript:

1 Language Modeling 1

2 Roadmap (for next two classes)  Review LMs  What are they?  How (and where) are they used?  How are they trained?  Evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation  Absolute Discounting  Kneser-Ney 2

3 What is a language model? Gives a probability of communication of transmitted signals of information (Claude Shannon, Information Theory) Lots of ties to Cryptography and Information Theory We most often use n-gram models 3

4 Applications 4

5  What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW 5 PhonemeExampleTranslation AHhutHH AH T AYhideHH AY D BbeB IY CHcheeseCH IY Z DdeeD IY EHEdEH D IYeatIY T KkeyK IY LleeL IY NkneeN IY PpeeP IY SseaS IY TteaT IY UWtwoT UW ZzeeZ IY

6 Applications  What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH 6 PhonemeExampleTranslation AHhutHH AH T AYhideHH AY D BbeB IY CHcheeseCH IY Z DdeeD IY EHEdEH D IYeatIY T KkeyK IY LleeL IY NkneeN IY PpeeP IY SseaS IY TteaT IY UWtwoT UW ZzeeZ IY

7 Applications  What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH  Goal of LM: P(“I’d like to recognize speech”) > P(“I’d like to wreck a nice beach”) 7 PhonemeExampleTranslation AHhutHH AH T AYhideHH AY D BbeB IY CHcheeseCH IY Z DdeeD IY EHEdEH D IYeatIY T KkeyK IY LleeL IY NkneeN IY PpeeP IY SseaS IY TteaT IY UWtwoT UW ZzeeZ IY

8 Why n-gram LMs? 8

9 Language Model Evaluation Metrics 9

10 Entropy and perplexity  Entropy measures the information content in a distribution == the uncertainty 10

11 Entropy and perplexity  Entropy measures the information content in a distribution == the uncertainty  If I can predict the next word before it comes, there’s no information content 11

12 Entropy and perplexity  Entropy measures the information content in a distribution == the uncertainty  If I can predict the next word before it comes, there’s no information content  Zero uncertainty means the signal has zero information  How many bits of additional information do I need to guess the next symbol? 12

13 Entropy and perplexity  Entropy measures the information content in a distribution == the uncertainty  If I can predict the next word before it comes, there’s no information content  Zero uncertainty means the signal has zero information  How many bits of additional information do I need to guess the next symbol?  Perplexity is the average branching factor  If message has zero information, 13

14 Entropy and perplexity  Entropy measures the information content in a distribution == the uncertainty  If I can predict the next word before it comes, there’s no information content  Zero uncertainty means the signal has zero information  How many bits of additional information do I need to guess the next symbol?  Perplexity is the average branching factor  If message has zero information, then branching factor is 1 14

15 Entropy and perplexity  Entropy measures the information content in a distribution == the uncertainty  If I can predict the next word before it comes, there’s no information content  Zero uncertainty means the signal has zero information  How many bits of additional information do I need to guess the next symbol?  Perplexity is the average branching factor  If message has zero information, then branching factor is 1  If message needs one bit, branching factor is 2  If message needs two bits, branching factor is 4 15

16 Entropy and perplexity  Entropy measures the information content in a distribution == the uncertainty  If I can predict the next word before it comes, there’s no information content  Zero uncertainty means the signal has zero information  How many bits of additional information do I need to guess the next symbol?  Perplexity is the average branching factor  If message has zero information, then branching factor is 1  If message needs one bit, branching factor is 2  If message needs two bits, branching factor is 4  Entropy and perplexity measure the same thing (uncertainty / information content) with different scales 16

17 Information in a fair coin flip 17 event spaceprobability entropy (bits)perplexity total112 heads0.5 tails0.5

18 Information in a fair coin flip 18 event spaceprobability entropy (bits)perplexity total112 heads0.5 tails0.5

19 Information in a fair coin flip 19 event spaceprobability entropy (bits)perplexity total112 heads0.5 tails0.5

20 Information in a single fair die 20 event spaceprobability entropy (bits)perplexity total112 11/60.43 21/60.43 31/60.43 41/60.43 51/60.43 61/60.43

21 Information in a single fair die 21 event spaceprobability entropy (bits)perplexity total112 11/60.43 21/60.43 31/60.43 41/60.43 51/60.43 61/60.43

22 Information in a single fair die 22 event spaceprobability entropy (bits)perplexity total116 11/60.43 21/60.43 31/60.43 41/60.43 51/60.43 61/60.43

23 Information in a single fair die 23 event spaceprobability entropy (bits)perplexity total16 11/60.43 21/60.43 31/60.43 41/60.43 51/60.43 61/60.43

24 Information in sum of two dice? event spaceprobability entropy (bits)perplexity total13.279.68 2 1/360.14 3 1/180.23 4 1/120.30 5 1/90.35 6 5/360.40 7 1/60.43 8 5/360.40 9 1/90.35 10 1/120.30 11 1/180.23 12 1/360.14 24

25 Entropy of a distribution 25

26 Entropy of a distribution 26

27 Computing Entropy 27

28 Computing Entropy 28 Ideal code length for this symbol

29 Computing Entropy 29 Ideal code length for this symbol

30 Entropy example 30

31 Entropy example 31

32 Perplexity 32

33 BIG SWITCH: Cross entropy and Language model perplexity 33

34 The Train/Test Split and Entropy 34

35 Cross entropy 35

36 Cross entropy, formally 36

37 Language model perplexity  Recipe:  Train a language model on training data  Get negative logprobs of test data, compute average  Exponentiate!  Perplexity correlates rather well with:  Speech recognition error rates  MT quality metrics  LM Perplexities for word-based models are normally between say 50 and 1000  Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact 37

38 Parameter estimation and smoothing 38

39 Tasks 39

40 Tasks 40

41 Tasks 41

42 Parameter estimation 42 Observations HHTHTHTTTHTH TTTTTTTTTTTT HHHTHHHTHHHH Parameters (Heads, Tails) (0.00, 1.00) (0.50, 0.50) (0.75, 0.25)

43 Parameter estimation techniques 43

44 Maximum Likelihood has problems :/ 44

45 Smoothing 45

46 Smoothing techniques 46

47 Mixtures / interpolation 47

48 Laplace as a mixture 48

49 BERP Corpus Bigrams  Original bigram probabilites

50 BERP Smoothed Bigrams  Smoothed bigram probabilities from the BERP

51 Laplace Smoothing Example  Consider the case where |V|= 100K  C(Bigram w 1 w 2 ) = 10; C(Trigram w 1 w 2 w 3 ) = 9

52 Laplace Smoothing Example  Consider the case where |V|= 100K  C(Bigram w 1 w 2 ) = 10; C(Trigram w 1 w 2 w 3 ) = 9  P MLE =

53 Laplace Smoothing Example  Consider the case where |V|= 100K  C(Bigram w 1 w 2 ) = 10; C(Trigram w 1 w 2 w 3 ) = 9  P MLE =9/10 = 0.9  P LAP =

54 Laplace Smoothing Example  Consider the case where |V|= 100K  C(Bigram w 1 w 2 ) = 10; C(Trigram w 1 w 2 w 3 ) = 9  P MLE =9/10 = 0.9  P LAP =(9+1)/10+100K ~ 0.0001

55 Laplace Smoothing Example  Consider the case where |V|= 100K  C(Bigram w 1 w 2 ) = 10; C(Trigram w 1 w 2 w 3 ) = 9  P MLE =9/10 = 0.9  P LAP =(9+1)/10+100K ~ 0.0001  Too much probability mass ‘shaved off’ for zeroes  Too sharp a change in probabilities  Problematic in practice

56 Add-δSmoothing  Problem: Adding 1 moves too much probability mass

57 Add-δSmoothing  Problem: Adding 1 moves too much probability mass  Proposal: Add smaller fractional mass δ  P add- δ (w i |w i-1 )

58 Add-δSmoothing  Problem: Adding 1 moves too much probability mass  Proposal: Add smaller fractional mass δ  P add- δ (w i |w i-1 ) =  Issues:

59 Add-δSmoothing  Problem: Adding 1 moves too much probability mass  Proposal: Add smaller fractional mass δ  P add- δ (w i |w i-1 ) =  Issues:  Need to pick δ  Still performs badly

60 Good-Turing Smoothing  New idea: Use counts of things you have seen to estimate those you haven’t

61 Good-Turing Smoothing  New idea: Use counts of things you have seen to estimate those you haven’t  Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams

62 Good-Turing Smoothing  New idea: Use counts of things you have seen to estimate those you haven’t  Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams  Notation: N c is the frequency of frequency c  Number of ngrams which appear c times  N 0 : # ngrams of count 0; N 1: # of ngrams of count 1

63 Good-Turing Smoothing  Estimate probability of things which occur c times with the probability of things which occur c+1 times

64 Good-Turing Josh Goodman Intuition  Imagine you are fishing  There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass  You have caught  10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish  How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? Slide adapted from Josh Goodman, Dan Jurafsky

65 Good-Turing Josh Goodman Intuition  Imagine you are fishing  There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass  You have caught  10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish  How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?  3/18  Assuming so, how likely is it that next species is trout? Slide adapted from Josh Goodman, Dan Jurafsky

66 Good-Turing Josh Goodman Intuition  Imagine you are fishing  There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass  You have caught  10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish  How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?  3/18  Assuming so, how likely is it that next species is trout?  Must be less than 1/18 Slide adapted from Josh Goodman, Dan Jurafsky

67 GT Fish Example

68 Bigram Frequencies of Frequencies and GT Re-estimates

69 Good-Turing Smoothing  N-gram counts to conditional probability  c * from GT estimate

70 Backoff and Interpolation  Another really useful source of knowledge  If we are estimating:  trigram p(z|x,y)  but count(xyz) is zero  Use info from:

71 Backoff and Interpolation  Another really useful source of knowledge  If we are estimating:  trigram p(z|x,y)  but count(xyz) is zero  Use info from:  Bigram p(z|y)  Or even:  Unigram p(z)

72 Backoff and Interpolation  Another really useful source of knowledge  If we are estimating:  trigram p(z|x,y)  but count(xyz) is zero  Use info from:  Bigram p(z|y)  Or even:  Unigram p(z)  How to combine this trigram, bigram, unigram info in a valid fashion?

73 Backoff Vs. Interpolation  Backoff: use trigram if you have it, otherwise bigram, otherwise unigram

74 Backoff Vs. Interpolation  Backoff: use trigram if you have it, otherwise bigram, otherwise unigram  Interpolation: always mix all three

75 Interpolation  Simple interpolation

76 Interpolation  Simple interpolation  Lambdas conditional on context:  Intuition: Higher weight on more frequent n-grams

77 How to Set the Lambdas?  Use a held-out, or development, corpus  Choose lambdas which maximize the probability of some held-out data  I.e. fix the N-gram probabilities  Then search for lambda values  That when plugged into previous equation  Give largest probability for held-out set  Can use EM to do this search

78 Katz Backoff

79  Note: We used P* (discounted probabilities) and α weights on the backoff values  Why not just use regular MLE estimates?

80 Katz Backoff  Note: We used P* (discounted probabilities) and α weights on the backoff values  Why not just use regular MLE estimates?  Sum over all w i in n-gram context  If we back off to lower n-gram?

81 Katz Backoff  Note: We used P* (discounted probabilities) and α weights on the backoff values  Why not just use regular MLE estimates?  Sum over all w i in n-gram context  If we back off to lower n-gram?  Too much probability mass  > 1

82 Katz Backoff  Note: We used P* (discounted probabilities) and α weights on the backoff values  Why not just use regular MLE estimates?  Sum over all w i in n-gram context  If we back off to lower n-gram?  Too much probability mass  > 1  Solution:  Use P* discounts to save mass for lower order ngrams  Apply α weights to make sure sum to amount saved  Details in 4.7.1

83 Toolkits  Two major language modeling toolkits  SRILM  Cambridge-CMU toolkit

84 Toolkits  Two major language modeling toolkits  SRILM  Cambridge-CMU toolkit  Publicly available, similar functionality  Training: Create language model from text file  Decoding: Computes perplexity/probability of text

85 OOV words: word  Out Of Vocabulary = OOV words

86 OOV words: word  Out Of Vocabulary = OOV words  We don’t use GT smoothing for these

87 OOV words: word  Out Of Vocabulary = OOV words  We don’t use GT smoothing for these  Because GT assumes we know the number of unseen events  Instead: create an unknown word token

88 OOV words: word  Out Of Vocabulary = OOV words  We don’t use GT smoothing for these  Because GT assumes we know the number of unseen events  Instead: create an unknown word token  Training of probabilities  Create a fixed lexicon L of size V  At text normalization phase, any training word not in L changed to  Now we train its probabilities like a normal word

89 OOV words: word  Out Of Vocabulary = OOV words  We don’t use GT smoothing for these  Because GT assumes we know the number of unseen events  Instead: create an unknown word token  Training of probabilities  Create a fixed lexicon L of size V  At text normalization phase, any training word not in L changed to  Now we train its probabilities like a normal word  At decoding time  If text input: Use UNK probabilities for any word not in training

90 Google N-Gram Release

91  serve as the incoming 92  serve as the incubator 99  serve as the independent 794  serve as the index 223  serve as the indication 72  serve as the indicator 120  serve as the indicators 45  serve as the indispensable 111  serve as the indispensible 40  serve as the individual 234

92 Google Caveat  Remember the lesson about test sets and training sets... Test sets should be similar to the training set (drawn from the same distribution) for the probabilities to be meaningful.  So... The Google corpus is fine if your application deals with arbitrary English text on the Web.  If not then a smaller domain specific corpus is likely to yield better results.

93 Class-Based Language Models  Variant of n-gram models using classes or clusters

94 Class-Based Language Models  Variant of n-gram models using classes or clusters  Motivation: Sparseness  Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)  Relate probability of n-gram to word classes & class ngram

95 Class-Based Language Models  Variant of n-gram models using classes or clusters  Motivation: Sparseness  Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)  Relate probability of n-gram to word classes & class ngram  IBM clustering: assume each word in single class  P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i )  Learn by MLE from data  Where do classes come from?

96 Class-Based Language Models  Variant of n-gram models using classes or clusters  Motivation: Sparseness  Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)  Relate probability of n-gram to word classes & class ngram  IBM clustering: assume each word in single class  P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i )  Learn by MLE from data  Where do classes come from?  Hand-designed for application (e.g. ATIS)  Automatically induced clusters from corpus

97 Class-Based Language Models  Variant of n-gram models using classes or clusters  Motivation: Sparseness  Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)  Relate probability of n-gram to word classes & class ngram  IBM clustering: assume each word in single class  P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i )  Learn by MLE from data  Where do classes come from?  Hand-designed for application (e.g. ATIS)  Automatically induced clusters from corpus

98 LM Adaptation  Challenge: Need LM for new domain  Have little in-domain data

99 LM Adaptation  Challenge: Need LM for new domain  Have little in-domain data  Intuition: Much of language is pretty general  Can build from ‘general’ LM + in-domain data

100 LM Adaptation  Challenge: Need LM for new domain  Have little in-domain data  Intuition: Much of language is pretty general  Can build from ‘general’ LM + in-domain data  Approach: LM adaptation  Train on large domain independent corpus  Adapt with small in-domain data set  What large corpus?

101 LM Adaptation  Challenge: Need LM for new domain  Have little in-domain data  Intuition: Much of language is pretty general  Can build from ‘general’ LM + in-domain data  Approach: LM adaptation  Train on large domain independent corpus  Adapt with small in-domain data set  What large corpus?  Web counts! e.g. Google n-grams

102 Incorporating Longer Distance Context  Why use longer context?

103 Incorporating Longer Distance Context  Why use longer context?  N-grams are approximation  Model size  Sparseness

104 Incorporating Longer Distance Context  Why use longer context?  N-grams are approximation  Model size  Sparseness  What sorts of information in longer context?

105 Incorporating Longer Distance Context  Why use longer context?  N-grams are approximation  Model size  Sparseness  What sorts of information in longer context?  Priming  Topic  Sentence type  Dialogue act  Syntax

106 Long Distance LMs  Bigger n!  284M words: <= 6-grams improve; 7-20 no better

107 Long Distance LMs  Bigger n!  284M words: <= 6-grams improve; 7-20 no better  Cache n-gram:  Intuition: Priming: word used previously, more likely  Incrementally create ‘cache’ unigram model on test corpus  Mix with main n-gram LM

108 Long Distance LMs  Bigger n!  284M words: <= 6-grams improve; 7-20 no better  Cache n-gram:  Intuition: Priming: word used previously, more likely  Incrementally create ‘cache’ unigram model on test corpus  Mix with main n-gram LM  Topic models:  Intuition: Text is about some topic, on-topic words likely  P(w|h) ~ Σt P(w|t)P(t|h)

109 Long Distance LMs  Bigger n!  284M words: <= 6-grams improve; 7-20 no better  Cache n-gram:  Intuition: Priming: word used previously, more likely  Incrementally create ‘cache’ unigram model on test corpus  Mix with main n-gram LM  Topic models:  Intuition: Text is about some topic, on-topic words likely  P(w|h) ~ Σt P(w|t)P(t|h)  Non-consecutive n-grams:  skip n-grams, triggers, variable lengths n-grams

110 Language Models  N-gram models:  Finite approximation of infinite context history  Issues: Zeroes and other sparseness  Strategies: Smoothing  Add-one, add-δ, Good-Turing, etc  Use partial n-grams: interpolation, backoff  Refinements  Class, cache, topic, trigger LMs

111 Kneser-Ney Smoothing  Most commonly used modern smoothing technique  Intuition: improving backoff  I can’t see without my reading……  Compare P(Francisco|reading) vs P(glasses|reading)

112 Kneser-Ney Smoothing  Most commonly used modern smoothing technique  Intuition: improving backoff  I can’t see without my reading……  Compare P(Francisco|reading) vs P(glasses|reading)  P(Francisco|reading) backs off to P(Francisco)

113 Kneser-Ney Smoothing  Most commonly used modern smoothing technique  Intuition: improving backoff  I can’t see without my reading……  Compare P(Francisco|reading) vs P(glasses|reading)  P(Francisco|reading) backs off to P(Francisco)  P(glasses|reading) > 0  High unigram frequency of Francisco > P(glasses|reading)

114 Kneser-Ney Smoothing  Most commonly used modern smoothing technique  Intuition: improving backoff  I can’t see without my reading……  Compare P(Francisco|reading) vs P(glasses|reading)  P(Francisco|reading) backs off to P(Francisco)  P(glasses|reading) > 0  High unigram frequency of Francisco > P(glasses|reading)  However, Francisco appears in few contexts, glasses many

115 Kneser-Ney Smoothing  Most commonly used modern smoothing technique  Intuition: improving backoff  I can’t see without my reading……  Compare P(Francisco|reading) vs P(glasses|reading)  P(Francisco|reading) backs off to P(Francisco)  P(glasses|reading) > 0  High unigram frequency of Francisco > P(glasses|reading)  However, Francisco appears in few contexts, glasses many  Interpolate based on # of contexts  Words seen in more contexts, more likely to appear in others

116 Kneser-Ney Smoothing  Modeling diversity of contexts  Continuation probability

117 Kneser-Ney Smoothing  Modeling diversity of contexts  Continuation probability  Backoff:

118 Kneser-Ney Smoothing  Modeling diversity of contexts  Continuation probability  Backoff:  Interpolation:

119 Issues  Relative frequency  Typically compute count of sequence  Divide by prefix  Corpus sensitivity  Shakespeare vs Wall Street Journal  Very unnatural  Ngrams  Unigram: little; bigrams: colloc;trigrams:phrase

120 Additional Issues in Good-Turing  General approach:  Estimate of c * for N c depends on N c+1  What if N c+1 = 0?  More zero count problems  Not uncommon: e.g. fish example, no 4s

121 Modifications  Simple Good-Turing  Compute N c bins, then smooth N c to replace zeroes  Fit linear regression in log space  log(N c ) = a +b log(c)  What about large c’s?  Should be reliable  Assume c * =c if c is large, e.g c > k (Katz: k =5)  Typically combined with other interpolation/backoff


Download ppt "Language Modeling 1. Roadmap (for next two classes)  Review LMs  What are they?  How (and where) are they used?  How are they trained?  Evaluation."

Similar presentations


Ads by Google