Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences.

Similar presentations


Presentation on theme: "Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences."— Presentation transcript:

1 Natural Language Processing Language Model

2 Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a language For NLP, a probabilistic model of a language that gives a probability that a string is a member of a language is more useful To specify a correct probability distribution, the probability of all sentences in a language must sum to 1

3 Uses of Language Models Speech recognition – “I ate a cherry” is a more likely sentence than “Eye eight uh Jerry” OCR & Handwriting recognition – More probable sentences are more likely correct readings Machine translation – More likely sentences are probably better translations Generation – More likely sentences are probably better NL generations Context sensitive spelling correction – “Their are problems wit this sentence.”

4 Completion Prediction A language model also supports predicting the completion of a sentence – Please turn off your cell _____ – Your program does not ______ Predictive text input systems can guess what you are typing and give choices on how to complete it

5 Probability P(X) means probability that X is true – P(baby is a boy)  0.5 (% of total that are boys) – P(baby is named John)  0.001 (% of total named John) Babies Baby boys John

6 Probability P(X|Y) means probability that X is true when we already know Y is true – P(baby is named John | baby is a boy)  0.002 – P(baby is a boy | baby is named John )  1 Babies Baby boys John

7 Probability P(X|Y) = P(X, Y) / P(Y) – P( baby is named John | baby is a boy ) = P( baby is named John, baby is a boy ) / P (baby is a boy ) = 0.001 / 0.5 = 0.002 Babies Baby boys John

8 Bayes Rule Bayes rule: P(X|Y) = P(Y|X)  P(X) / P(Y) P( named John | boy ) = P( boy | named John )  P( named John ) / P( boy ) Babies Baby boys John

9 Word Sequence Probabilities Given a word sequence (sentence): its probability is: The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram model is a (N  1)-order Markov model

10 N-Gram Models Estimate probability of each word given prior context – P(phone | Please turn off your cell) Number of parameters required grows exponentially with the number of words of prior context An N-gram model uses only N  1 words of prior context. – Unigram: P(phone) – Bigram: P(phone | cell) – Trigram: P(phone | your cell) Bigram approximation: N-gram approximation:

11 Estimating Probabilities N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences. To have a consistent probabilistic model, append a unique start ( ) and end ( ) symbol to every sentence and treat these as additional words. Bigram: N-gram:

12 Generative Model and MLE An N-gram model can be seen as a probabilistic automata for generating sentences. Relative frequency estimates can be proven to be maximum likelihood estimates (MLE) since they maximize the probability that the model M will generate the training corpus T. Initialize sentence with N  1 symbols Until is generated do: Stochastically pick the next word based on the conditional probability of each word given the previous N  1 words. Initialize sentence with N  1 symbols Until is generated do: Stochastically pick the next word based on the conditional probability of each word given the previous N  1 words.

13 Example Estimate the likelihood of the sentence: I want to eat Chinese food P(I want to eat Chinese food) = P(I | ) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P( |food) What do we need to calculate these likelihoods? – Bigram probabilities for each word pair sequence in the sentence – Calculated from a large corpus

14 Corpus A language model must be trained on a large corpus of text to estimate good parameter values Model can be evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate) Ideally, the training (and test) corpus should be representative of the actual application data

15 Terminology Types: number of distinct words in a corpus (vocabulary size) Tokens: total number of words

16 Early Bigram Probabilities from BERP.001Eat British.03Eat today.007Eat dessert.04Eat Indian.01Eat tomorrow.04Eat a.02Eat Mexican.04Eat at.02Eat Chinese.05Eat dinner.02Eat in.06Eat lunch.03Eat breakfast.06Eat some.03Eat Thai.16Eat on

17 .01British lunch.05Want a.01British cuisine.65Want to.15British restaurant.04I have.60British food.08I don ’ t.02To be.29I would.09To spend.32I want.14To have.02 I ’ m.26To eat.04 Tell.01Want Thai.06 I ’ d.04Want some.25 I

18 Back to our sentence… I want to eat Chinese food 0100004Lunch 000017019Food 112000002Chinese 522190200Eat 12038601003To 686078603Want 00013010878I lunchFoodChineseEatToWantI

19 Relative Frequencies Normalization: divide each row's counts by appropriate unigram counts for w n-1 Computing the bigram probability of I I – C(I,I)/C( I) – p (I|I) = 8 / 3437 =.0023 4591506213938325612153437 LunchFoodChineseEatToWantI

20 Approximating Shakespeare Generating sentences with random unigrams... – Every enter now severally so, let – Hill he late speaks; or! a more to leg less first you enter With bigrams... – What means, sir. I confess she? then all sorts, he is trim, captain. – Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Trigrams – Sweet prince, Falstaff shall die. – This shall forbid it should be branded, if renown made it empty.

21 Approximating Shakespeare Quadrigrams – What! I will go seek the traitor Gloucester. – Will you not tell me who I am? – What's coming out here looks like Shakespeare because it is Shakespeare Note: As we increase the value of N, the accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained

22 Evaluation Perplexity and entropy: how do you estimate how well your language model fits a corpus once you ’ re done?

23 Random Variables A variable defined by the probabilities of each possible value in the population Discrete Random Variable – Whole Number (0, 1, 2, 3 etc.) – Countable, Finite Number of Values Jump from one value to the next and cannot take any values in between Continuous Random Variables – Whole or Fractional Number – Obtained by Measuring – Infinite Number of Values in Interval Too Many to List Like Discrete Variable

24 Discrete Random Variables For example: # of girls in the family # of of correct answers of a given exam …

25 Mass Probability Function Probability – x = Value of Random Variable (Outcome) – p(x) = Probability Associated with Value Mutually Exclusive (No Overlap) Collectively Exhaustive (Nothing Left Out) 0  p(x)  1  p(x) = 1

26 Measures Expected Value – Mean of Probability Distribution – Weighted Average of All Possible Values –  = E(X) =  x p(x) Variance – Weighted Average Squared Deviation about Mean –  2 = V(X)= E[ (x  (x  p(x) –  2 = V(X)=E(X  )  -[E(X)  Standard Deviation –  =   2 = SD(X) 

27 Perplexity and Entropy Information theoretic metrics – Useful in measuring how well a grammar or language model (LM) models a natural language or a corpus Entropy: With 2 LMs and a corpus, which LM is the better match for the corpus? How much information is there (in e.g. a grammar or LM) about what the next word will be? For a random variable X ranging over e.g. bigrams and a probability function p(x), the entropy of X is the expected negative log probability

28 Entropy- Example Horse race – 8 horses, we want to send a bet to the bookie In the naïve way – 3 bits message Can we do better? Suppose we know the distribution of the bets placed, i.e. - Horse1 ½Horse5 1/64 Horse2 ¼Horse6 1/64 Horse3 1/8Horse7 1/64 Horse4 1/16Horse8 1/64

29 Entropy - Example The Entropy of the random variable X give lower bound on the number of bits –

30 Perplexity The weighted average number of choices a random variable has to make In the previous example its 8

31 Entropy of a Language Measuring all the sequences of size n in a language L: Entropy rate:

32 Cross Entropy and Perplexity Given an approximation probability of the language p and a model m we want to measure how good m predicts p

33 Perplexity Better models m of the unknown distribution p will tend to assign higher probabilities m(x i ) to the test events. Thus, they have lower perplexity: they are less surprised by the test sample

34 Example Slide from Philip Keohn

35 Comparison 1-4 grams LM Slide from Philip Keohn

36 Smoothing Words follow a Zipfian distribution – Small number of words occur very frequently – A large number are seen only once Zero probabilities on one bigram cause a zero probability on the entire sentence So….how do we estimate the likelihood of unseen n-grams?

37 Steal from the rich and give to the poor (in probability mass) Slide from Dan Klein

38 Add-One Smoothing For all possible n-grams, add the count of one: – C = count of n-gram in corpus – N = count of history – V = vocabulary size But there are many more unseen n-grams than seen n-grams

39 Add-One Smoothing unsmoothed bigram counts: unsmoothed normalized bigram probabilities: 1 st word 2 nd word Vocabulary = 1616

40 Add-One Smoothing add-one smoothed bigram counts: add-one normalized bigram probabilities:

41 Add-One Problems bigrams starting with Chinese are boosted by a factor of 8! (1829 / 213) unsmoothed bigram counts: add-one smoothed bigram counts:

42 Add-One Problems Every previously unseen n-gram is given a low probability, but there are so many of them that too much probability mass is given to unseen events Adding 1 to frequent bigram, does not change much, but adding 1 to low bigrams (including unseen ones) boosts them too much! In NLP applications that are very sparse, Add-One actually gives far too much of the probability space to unseen events

43 Witten-Bell Smoothing Intuition: – An unseen n-gram is one that just did not occur yet – When it does happen, it will be its first occurrence – So give to unseen n-grams the probability of seeing a new n-gram

44 Witten-Bell – Unigram Case N: number of tokens T: number of types (diff. observed words) - can be different than V (number of words in dictionary) Prob. of unseen unigrams Prob. of seen unigrams

45 Witten-Bell – bigram case Prob. of unseen unigrams Prob. of seen unigrams

46 Witten-Bell Example The original counts were – T(w)= number of different seen bigrams types starting with w We have a vocabulary of 1616 words, so we can compute Z(w)= number of unseen bigrams types starting with w Z(w) = 1616 - T(w) N(w) = number of bigrams tokens starting with w

47 Witten-Bell Example WB smoothed probabilities:

48 Back-off So far, we gave the same probability to all unseen n-grams – we have never seen the bigrams journal of P unsmoothed (of |journal) = 0 journal from P unsmoothed (from |journal) = 0 journal never P unsmoothed (never |journal) = 0 – all models so far will give the same probability to all 3 bigrams but intuitively, “journal of” is more probable because... – “of” is more frequent than “from” & “never” – unigram probability P(of) > P(from) > P(never)

49 Back-off Observation: – unigram model suffers less from data sparseness than bigram model – bigram model suffers less from data sparseness than trigram model – … So use a lower model estimate, to estimate probability of unseen n-grams If we have several models of how the history predicts what comes next, we can combine them in the hope of producing an even better model

50 Linear Interpolation Solve the sparseness in a trigram model by mixing with bigram and unigram models Also called: – linear interpolation – finite mixture models – deleted interpolation Combine linearly P li (w n |w n-2,w n-1 ) = 1 P(w n ) + 2 P(w n |w n-1 ) + 3 P(w n |w n-2,w n-1 ) – where 0  i  1 and  i i =1

51 Back-off Smoothing Smoothing of Conditional Probabilities p(Angeles | to, Los) If „to Los Angeles“ is not in the training corpus, the smoothed probability p(Angeles | to, Los) is identical to p(York | to, Los) However, the actual probability is probably close to the bigram probability p(Angeles | Los)

52 Back-off Smoothing (Wrong) Back-off Smoothing of trigram probabilities if C(w‘, w‘‘, w) > 0 P*(w | w‘, w‘‘) = P(w | w‘, w‘‘) else if C(w‘‘, w) > 0 P*(w | w‘, w‘‘) = P(w | w‘‘) else if C(w) > 0 P*(w | w‘, w‘‘) = P(w) else P*(w | w‘, w‘‘) = 1 / #words

53 Back-off Smoothing Problem: not a probability distribution Solution: Combination of Back-off and Smoothing if C(w 1,...,w k,w) > 0 P(w | w 1,...,w k ) = C* (w 1,...,w k,w) / N else P(w | w 1,...,w k ) =  (w 1,...,w k ) P(w | w 2,...,w k )


Download ppt "Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences."

Similar presentations


Ads by Google