Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fall 2004 1 BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.

Similar presentations


Presentation on theme: "Fall 2004 1 BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing."— Presentation transcript:

1 Fall 2004 1 BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing

2 2 Statistical Methods in NLE Two characteristics of NL make it desirable to endow programs with the ability to LEARN from examples of past use: – VARIETY (no programmer can really take into account all possibilities) – AMBIGUITY (need to have ways of choosing between alternatives) In a number of NLE applications, statistical methods are very common The simplest application: WORD PREDICTION

3 3 We are good at word prediction Stocks plunged this morning, despite a cut in interestStocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began ….

4 4 Real Spelling Errors They are leaving in about fifteen minuets to go to her house The study was conducted mainly be John Black. The design an construction of the system will take more than one year. Hopefully, all with continue smoothly in my absence. Can they lave him my messages? I need to notified the bank of this problem. He is trying to fine out.

5 5 The `cloze’ task Pablo did not get up at seven o’clock, as he always does. He woke up late, at eight o’clock. He dressed quickly and came out of the house barefoot. He entered the garage __ could not open his __ door. Therefore, he had __ go to the office __ bus. But when he __ to pay his fare __ the driver, he realized __ he did not have __ money. Because of that, __ had to walk. When __ finally got into the __, his boss was offended __ Pablo treated him impolitely.

6 6 Handwriting recognition From Woody Allen’s Take the Money and Run (1969) – Allen (a bank robber), walks up to the teller and hands her a note that reads. "I have a gun. Give me all your cash." The teller, however, is puzzled, because he reads "I have a gub." "No, it's gun", Allen says. "Looks like 'gub' to me," the teller says, then asks another teller to help him read the note, then another, and finally everyone is arguing over what the note means.

7 7 Applications of word prediction Spelling checkers Mobile phone texting Speech recognition Handwriting recognition Disabled users

8 8 Statistics and word prediction The basic idea underlying the statistical approach to word prediction is to use the probabilities of SEQUENCES OF WORDS to choose the most likely next word / correction of spelling error I.e., to compute For all words w, and predict as next word the one for which this (conditional) probability is highest. P(w | W 1 …. W N-1 )

9 9 Using corpora to estimate probabilities But where do we get these probabilities? Idea: estimate them by RELATIVE FREQUENCY. The simplest method: Maximum Likelihood Estimate (MLE). Count the number of words in a corpus, then count how many times a given sequence is encountered. ‘Maximum’ because doesn’t waste any probability on events not in the corpus

10 10 Maximum Likelihood Estimation for conditional probabilities In order to estimate P(w|W1 … WN), we can use instead: Cfr.: – P(A|B) = P(A&B) / P(B)

11 11 Aside: counting words in corpora Keep in mind that it’s not always so obvious what ‘a word’ is (cfr. yesterday) In text: – He stepped out into the hall, was delighted to encounter a brother. (From the Brown corpus.) In speech: – I do uh main- mainly business data processing LEMMAS: cats vs cat TYPES vs. TOKENS

12 12 The problem: sparse data In principle, we would like the n of our models to be fairly large, to model ‘long distance’ dependencies such as: – Sue SWALLOWED the large green … However, in practice, most events of encountering sequences of words of length greater than 3 hardly ever occur in our corpora! (See below) (Part of the) Solution: we APPROXIMATE the probability of a word given all previous words

13 13 The Markov Assumption The probability of being in a certain state only depends on the previous state: P(Xn = Sk| X1 … Xn-1) = P(Xn = Sk|Xn-1) This is equivalent to the assumption that the next state only depends on the previous m inputs, for m finite (N-gram models / Markov models can be seen as probabilistic finite state automata)

14 14 The Markov assumption for language: n-grams models Making the Markov assumption for word prediction means assuming that the probability of a word only depends on the previous n words (N-GRAM model)

15 15 Bigrams and trigrams Typical values of n are 2 or 3 (BIGRAM or TRIGRAM models): P(W n |W 1 ….. W n-1 ) ~ P(W n |W n-2,W n-1 ) P(W 1,…W n ) ~ П P(W i | W i-2,W i-1 ) What bigram model means in practice: – Instead of P(rabbit|Just the other day I saw a) – We use P(rabbit|a) Unigram: P(dog) Bigram: P(dog|big) Trigram: P(dog|the,big)

16 16 The chain rule So how can we compute the probability of sequences of words longer than 2 or 3? We use the CHAIN RULE: E.g., – P(the big dog) = P(the) P(big|the) P(dog|the big) Then we use the Markov assumption to reduce this to manageable proportions:

17 17 Example: the Berkeley Restaurant Project (BERP) corpus BERP is a speech-based restaurant consultant The corpus contains user queries; examples include – I’m looking for Cantonese food – I’d like to eat dinner someplace nearby – Tell me about Chez Panisse – I’m looking for a good place to eat breakfast

18 18 Computing the probability of a sentence Given a corpus like BERP, we can compute the probability of a sentence like “I want to eat Chinese food” Making the bigram assumption and using the chain rule, the probability can be approximated as follows: – P(I want to eat Chinese food) ~ P(I|”sentence start”) P(want|I) P(to|want)P(eat|to) P(Chinese|eat)P(food|Chinese)

19 19 Bigram counts

20 20 How the bigram probabilities are computed Example of P(I,I): – C(“I”,”I”): 8 – C(“I”): 8 + 1087 + 13 …. = 3437 – P(“I”|”I”) = 8 / 3437 =.0023

21 21 Bigram probabilities P(.|want)

22 22 The probability of the example sentence P(I want to eat Chinese food)  P(I|”sentence start”) * P(want|I) * P(to|want) * P(eat|to) * P(Chinese|eat) * P(food|Chinese) = Assume P(I|”start of sentence”) =.25 P =.25 *.32 *.65 *.26 *.020 *.56 =.000151

23 23 Examples of actual bigram probabilities computed using BERP

24 24 The tradeoff between prediction and sparsity: comparing Austen n-grams In person shewasinferiorto 1-gramP(.) 1the.034the.034the.034the.034 2to.032to.032to.032to.032 3and.030and.030and.030 … 8was.015was.015 … 13she.011 … 1701inferior.00005

25 25 Comparing Austen n-grams: bigrams In person shewasinferiorto 2-gramP(.|person)P(.|she)P(.|was)P(.inferior) 1and.099had.0141not.065to.212 2who.099was.122a.052 … 23she.009 … inferior0

26 26 Comparing Austen n-grams: trigrams In person shewasinferiorto 3-gramP(.|In,person)P(.|person, she) P(.|she, was) P(.was, inferior) 1UNSEENdid.05not.057UNSEEN 2was.05very.038 … inferior0

27 27 Evaluating n-gram based language models: the Shannon/Miller/Selfridge method For unigrams: – Choose a random value r between 0 and 1 – Print out w such that P(w) = r For bigrams: – Choose a random bigram P(w| ) – Then pick up bigrams to follow as before

28 28 The Shannon/Miller/Selfridge method trained on Shakespeare

29 29 Approximating Shakespeare, cont’d

30 30 A more formal evaluation mechanism Entropy Cross-entropy

31 31 Small corpora? The entire Shakespeare oeuvre consists of – 884,647 tokens (N) – 29,066 types (V) – 300,000 bigrams All of Jane Austen’s novels (on Manning and Schuetze’s website, also cc437/data): – N = 617,091 tokens – V = 14,585 types

32 32 Maybe with a larger corpus? Words such as ‘ergativity’ unlikely to be found outside a corpus of linguistic articles More in general: Zipf’s law

33 33 Zipf’s law for the Brown corpus

34 34 Addressing the zeroes SMOOTHING is re-evaluating some of the zero- probability and low-probability n-grams, assigning them non-zero probabilities – Add-one – Witten-Bell – Good-Turing BACK-OFF is using the probabilities of lower order n- grams when higher order ones are not available – Backoff – Linear interpolation

35 35 Add-one (‘Laplace’s Law’)

36 36 Effect on BERP bigram counts

37 37 Add-one bigram probabilities

38 38 The problem

39 39 The problem Add-one has a huge effect on probabilities: e.g., P(to|want) went from.65 to.28! Too much probability gets ‘removed’ from n- grams actually encountered – (more precisely: the ‘discount factor’

40 40 Witten-Bell Discounting How can we get a better estimate of the probabilities of things we haven’t seen? The Witten-Bell algorithm is based on the idea that a zero-frequency N-gram is just an event that hasn’t happened yet How often these events happen? We model this by the probability of seeing an N-gram for the first time (we just count the number of times we first encountered a type)

41 41 Witten-Bell: the equations Total probability mass assigned to zero-frequency N- grams: (NB: T is OBSERVED types, not V) So each zero N-gram gets the probability:

42 42 Witten-Bell: why ‘discounting’ Now of course we have to take away something (‘discount’) from the probability of the events seen more than once:

43 43 Witten-Bell for bigrams We `relativize’ the types to the previous word:

44 44 Add-one vs. Witten-Bell discounts for unigrams in the BERP corpus WordAdd-OneWitten-Bell “I’”.68.97 “want”.42.94 “to”.69.96 “eat”.37.88 “Chinese”.12.91 “food”.48.94 “lunch”.22.91

45 45 One last discounting method …. The best-known discounting method is GOOD- TURING (Good, 1953) Basic insight: re-estimate the probability of N- grams with zero counts by looking at the number of bigrams that occurred once For example, the revised count for bigrams that never occurred is estimated by dividing N 1, the number of bigrams that occurred once, by N 0, the number of bigrams that never occurred

46 46 Combining estimators A method often used (generally in combination with discounting methods) is to use lower-order estimates to ‘help’ with higher-order ones Backoff (Katz, 1987) Linear interpolation (Jelinek and Mercer, 1980)

47 47 Backoff: the basic idea

48 48 Backoff with discounting

49 49 A more radical solution: the Web as a corpus Keller and Lapata (2003): using the Web to obtain frequencies for unseen bigrams Corpora: the British National Corpus (150M words), Google, Altavista Average factor by which Web counts are larger than BNC counts: ~ 1,000 Percentage of bigrams unseen in BNC that are unseen using Google: 2% (7/270)

50 50 NB STILL need smoothing!!

51 51 Readings Jurafsky and Martin, chapter 6 The Statistics GlossaryStatistics Glossary Word prediction: – For mobile phones For mobile phones – For disabled users For disabled users Further reading: Manning and Schuetze, chapters 6 (Good-Turing)

52 52 Acknowledgments Some of the material in these slides was taken from lecture notes by Diane Litman & James Martin


Download ppt "Fall 2004 1 BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing."

Similar presentations


Ads by Google