1 N-Gram – Part 2 ICS 482 Natural Language Processing Lecture 8: N-Gram – Part 2 Husni Al-Muhtaseb.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

12/18/20141 Machine Translation ICS 482 Natural Language Processing Lecture 29-2: Machine Translation Husni Al-Muhtaseb.
N-Grams and Corpus Linguistics 6 July Linguistics vs. Engineering “But it must be recognized that the notion of “probability of a sentence” is an.
4/14/20151 REGULAR EXPRESSIONS AND AUTOMATA Lecture 3: REGULAR EXPRESSIONS AND AUTOMATA Husni Al-Muhtaseb.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
5/12/20151 N-Gram: Part 1 ICS 482 Natural Language Processing Lecture 7: N-Gram: Part 1 Husni Al-Muhtaseb.
CMSC 723 / LING 645: Intro to Computational Linguistics February 25, 2004 Lecture 5 (Dorr): Intro to Probabilistic NLP and N-grams (chap ) Prof.
CS 4705 N-Grams and Corpus Linguistics Julia Hirschberg CS 4705.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.
1 N-Grams and Corpus Linguistics September 2009 Lecture #5.
N-Grams and Corpus Linguistics.  Regular expressions for asking questions about the stock market from stock reports  Due midnight, Sept. 29 th  Use.
1 Lexicalized and Probabilistic Parsing – Part 1 ICS 482 Natural Language Processing Lecture 14: Lexicalized and Probabilistic Parsing – Part 1 Husni Al-Muhtaseb.
Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken.
Part II. Statistical NLP Advanced Artificial Intelligence Markov Models and N-gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Kristian Kersting Some.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
N-Grams and Corpus Linguistics
6/11/20151 Parsing Context-free grammars – Part 1 ICS 482 Natural Language Processing Lecture 12: Parsing Context-free grammars – Part 1 Husni Al-Muhtaseb.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Page 1 Language Modeling. Page 2 Next Word Prediction From a NY Times story... Stocks... Stocks plunged this …. Stocks plunged this morning, despite a.
CS 4705 Lecture 6 N-Grams and Corpus Linguistics.
NLP Credits and Acknowledgment These slides were adapted from presentations of the Authors of the book SPEECH and LANGUAGE PROCESSING: An Introduction.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
N-Grams and Language Modeling
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
CS 4705 N-Grams and Corpus Linguistics. Homework Use Perl or Java reg-ex package HW focus is on writing the “grammar” or FSA for dates and times The date.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
CS 4705 N-Grams and Corpus Linguistics. Spelling Correction, revisited M$ suggests: –ngram: NorAm –unigrams: anagrams, enigmas –bigrams: begrimes –trigrams:
CS 4705 Lecture 14 Corpus Linguistics II. Relating Conditionals and Priors P(A | B) = P(A ^ B) / P(B) –Or, P(A ^ B) = P(A | B) P(B) Bayes Theorem lets.
Semantics: Representing Meaning ICS 482 Natural Language Processing
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
1 N-Grams and Corpus Linguistics September 6, 2012 Lecture #4.
Machine Translation Course 3 Diana Trandab ă ț Academic year:
NGrams 09/16/2004 Instructor: Rada Mihalcea Note: some of the material in this slide set was adapted from an NLP course taught by Bonnie Dorr at Univ.
Formal Models of Language. Slide 1 Language Models A language model an abstract representation of a (natural) language phenomenon. an approximation to.
Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences.
1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
10/13/20151 Representing Meaning Part 4 & Semantic Analysis ICS 482 Natural Language Processing Lecture 21: Representing Meaning Part 4 & Semantic Analysis.
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
5/31/20161 Lexical Semantic + Students Presentations ICS 482 Natural Language Processing Lecture 26: Lexical Semantic + Students Presentations Husni Al-Muhtaseb.
1 Parts of Speech Part 1 ICS 482 Natural Language Processing Lecture 9: Parts of Speech Part 1 Husni Al-Muhtaseb.
Lecture 4 Ngrams Smoothing
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
29 November Parsing Context-free grammars Part 2 ICS 482 Natural Language Processing Lecture 13: Parsing Context-free grammars Part 2 Husni Al-Muhtaseb.
Estimating N-gram Probabilities Language Modeling.
12/22/20151 Representing Meaning Part 2 ICS 482 Natural Language Processing Lecture 18: Representing Meaning Part 2 Husni Al-Muhtaseb.
Natural Language Processing Statistical Inference: n-grams
1/31/20161 Lexical Semantic 2 ICS 482 Natural Language Processing Lecture 29-1: Lexical Semantic 2 Husni Al-Muhtaseb.
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
3/7/20161 Representing Meaning Part 3 ICS 482 Natural Language Processing Lecture 20: Representing Meaning Part 3 Husni Al-Muhtaseb.
Statistical Methods for NLP Diana Trandab ă ț
Statistical Methods for NLP
N-Grams Chapter 4 Part 2.
N-Grams and Corpus Linguistics
N-Grams and Corpus Linguistics
CPSC 503 Computational Linguistics
CSCI 5832 Natural Language Processing
CSCE 771 Natural Language Processing
Information Extraction ICS 482 Natural Language Processing
Lecture 27: Husni Al-Muhtaseb
Context-free grammars ICS 482 Natural Language Processing
Presentation transcript:

1 N-Gram – Part 2 ICS 482 Natural Language Processing Lecture 8: N-Gram – Part 2 Husni Al-Muhtaseb

2 بسم الله الرحمن الرحيم ICS 482 Natural Language Processing Lecture 8: N-Gram – Part 2 Husni Al-Muhtaseb

NLP Credits and Acknowledgment These slides were adapted from presentations of the Authors of the book SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition and some modifications from presentations found in the WEB by several scholars including the following

NLP Credits and Acknowledgment If your name is missing please contact me muhtaseb At Kfupm. Edu. sa

NLP Credits and Acknowledgment Husni Al-Muhtaseb James Martin Jim Martin Dan Jurafsky Sandiway Fong Song young in Paula Matuszek Mary-Angela Papalaskari Dick Crouch Tracy Kin L. Venkata Subramaniam Martin Volk Bruce R. Maxim Jan Hajič Srinath Srinivasa Simeon Ntafos Paolo Pirjanian Ricardo Vilalta Tom Lenaerts Heshaam Feili Björn Gambäck Christian Korthals Thomas G. Dietterich Devika Subramanian Duminda Wijesekera Lee McCluskey David J. Kriegman Kathleen McKeown Michael J. Ciaraldi David Finkel Min-Yen Kan Andreas Geyer- Schulz Franz J. Kurfess Tim Finin Nadjet Bouayad Kathy McCoy Hans Uszkoreit Azadeh Maghsoodi Khurshid Ahmad Staffan Larsson Robert Wilensky Feiyu Xu Jakub Piskorski Rohini Srihari Mark Sanderson Andrew Elks Marc Davis Ray Larson Jimmy Lin Marti Hearst Andrew McCallum Nick Kushmerick Mark Craven Chia-Hui Chang Diana Maynard James Allan Martha Palmer julia hirschberg Elaine Rich Christof Monz Bonnie J. Dorr Nizar Habash Massimo Poesio David Goss- Grubbs Thomas K Harris John Hutchins Alexandros Potamianos Mike Rosner Latifa Al-Sulaiti Giorgio Satta Jerry R. Hobbs Christopher Manning Hinrich Schütze Alexander Gelbukh Gina-Anne Levow Guitao Gao Qing Ma Zeynep Altan

6 Previous Lectures  Pre-start questionnaire  Introduction and Phases of an NLP system  NLP Applications - Chatting with Alice  Finite State Automata & Regular Expressions & languages  Deterministic & Non-deterministic FSAs  Morphology: Inflectional & Derivational  Parsing and Finite State Transducers  Stemming & Porter Stemmer  20 Minute Quiz  Statistical NLP – Language Modeling  N Grams

7 Today’s Lecture  NGrams  Bigram  Smoothing and NGram Add one smoothing Witten-Bell Smoothing

8 Simple N-Grams  An N-gram model uses the previous N-1 words to predict the next one: P(w n | w n -1 ) We'll be dealing with P( | )  unigrams: P(dog)  bigrams: P(dog | big)  trigrams: P(dog | the big)  quadrigrams: P(dog | the big dopey)

9 Chain Rule conditional probability: So: “the dog”: “the dog bites”:

10 Chain Rule the probability of a word sequence is the probability of a conjunctive event. Unfortunately, that’s really not helpful in general. Why?

11 Markov Assumption  P(w n ) can be approximated using only N-1 previous words of context  This lets us collect statistics in practice  Markov models are the class of probabilistic models that assume that we can predict the probability of some future unit without looking too far into the past  Order of a Markov model: length of prior context

12 Language Models and N-grams  Given a word sequence: w 1 w 2 w 3... w n  Chain rule p(w 1 w 2 ) = p(w 1 ) p(w 2 |w 1 ) p(w 1 w 2 w 3 ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w n |w 1...w n-2 w n-1 )  Note: It’s not easy to collect (meaningful) statistics on p(w n |w n-1 w n-2...w 1 ) for all possible word sequences  Bigram approximation just look at the previous word only (not all the proceedings words) Markov Assumption: finite length history 1st order Markov Model p(w 1 w 2 w 3..w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )..p(w n |w 1...w n-3 w n-2 w n-1 ) p(w 1 w 2 w 3..w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )..p(w n |w n-1 )  Note: p(w n |w n-1 ) is a lot easier to estimate well than p(w n |w 1..w n-2 w n-1 )

13 Language Models and N-grams  Given a word sequence: w 1 w 2 w 3... w n  Chain rule p(w 1 w 2 ) = p(w 1 ) p(w 2 |w 1 ) p(w 1 w 2 w 3 ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w n |w 1...w n-2 w n-1 )  Trigram approximation 2nd order Markov Model just look at the preceding two words only p(w 1 w 2 w 3 w 4...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) p(w 4 |w 1 w 2 w 3 )...p(w n |w 1...w n-3 w n-2 w n-1 ) p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )p(w 4 |w 2 w 3 )...p(w n |w n-2 w n-1 )  Note: p(w n |w n-2 w n-1 ) is a lot easier to estimate well than p(w n |w 1...w n-2 w n-1 ) but harder than p(w n |w n-1 )

14 Corpora  Corpora are (generally online) collections of text and speech  e.g. Brown Corpus (1M words) Wall Street Journal and AP News corpora ATIS, Broadcast News (speech) TDT (text and speech) Switchboard, Call Home (speech) TRAINS, FM Radio (speech)

15 Sample Word frequency (count)Data (The Text REtrieval Conference) - (from B. Croft, UMass)

16 Counting Words in Corpora  Probabilities are based on counting things, so ….  What should we count?  Words, word classes, word senses, speech acts …?  What is a word? e.g., are cat and cats the same word? September and Sept? zero and oh? Is seventy-two one word or two? AT&T?  Where do we find the things to count?

17 Terminology  Sentence: unit of written language  Utterance: unit of spoken language  Wordform: the inflected form that appears in the corpus  Lemma: lexical forms having the same stem, part of speech, and word sense  Types: number of distinct words in a corpus (vocabulary size)  Tokens: total number of words

18 Training and Testing  Probabilities come from a training corpus, which is used to design the model. narrow corpus: probabilities don't generalize general corpus: probabilities don't reflect task or domain  A separate test corpus is used to evaluate the model

19 Simple N-Grams  An N-gram model uses the previous N-1 words to predict the next one: P(w n | w n -1 ) We'll be dealing with P( | )  unigrams: P(dog)  bigrams: P(dog | big)  trigrams: P(dog | the big)  quadrigrams: P(dog | the big red)

20 Using N-Grams  Recall that P(w n | w 1..n-1 )  P(w n | w n-N+1..n-1 )  For a bigram grammar P(sentence) can be approximated by multiplying all the bigram probabilities in the sequence P(I want to eat Chinese food) = P(I | ) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P( |food)

21 Chain Rule  Recall the definition of conditional probabilities  Rewriting  Or…

22 Example  The big red dog  P(The)*P(big|the)*P(red|the big)*P(dog|the big red)  Better P(The| ) written as P(The | )  Also for end of sentence

23 General Case  The word sequence from position 1 to n is  So the probability of a sequence is

24 Unfortunately  That doesn’t help since its unlikely we’ll ever gather the right statistics for the prefixes.

25 Markov Assumption  Assume that the entire prefix history isn’t necessary.  In other words, an event doesn’t depend on all of its history, just a fixed length near history

26 Markov Assumption  So for each component in the product replace each with its approximation (assuming a prefix (Previous words) of N)

27 N-Grams The big red dog  Unigrams:P(dog)  Bigrams:P(dog|red)  Trigrams:P(dog|big red)  Four-grams:P(dog|the big red) In general, we’ll be dealing with P(Word| Some fixed prefix) Note: prefix is Previous words

28 N-gram models can be trained by counting and normalization Bigram: Ngram:

29 An example  I am Sam  Sam I am  I do not like green eggs and meet

30 BERP Bigram Counts BErkeley Restaurant Project (speech) IWantToEatChineseFoodlunch I Want To Eat Chinese Food Lunch

31 BERP Bigram Probabilities  Normalization: divide each row's counts by appropriate unigram counts  Computing the probability of I I C(I|I)/C(all I) p = 8 / 3437 =.0023  A bigram grammar is an NxN matrix of probabilities, where N is the vocabulary size IWantToEatChineseFoodLunch

32 A Bigram Grammar Fragment from BERP Eat on.16Eat Thai.03 Eat some.06Eat breakfast.03 Eat lunch.06Eat in.02 Eat dinner.05Eat Chinese.02 Eat at.04Eat Mexican.02 Eat a.04Eat tomorrow.01 Eat Indian.04Eat dessert.007 Eat today.03Eat British.001

33 I.25Want some.04 I’d.06Want Thai.01 Tell.04To eat.26 I’m.02To have.14 I want.32To spend.09 I would.29To be.02 I don’t.08British food.60 I have.04British restaurant.15 Want to.65British cuisine.01 Want a.05British lunch.01

34 Language Models and N-grams  Example: unigram frequencies w n-1 w n bigram frequencies bigram probabilities sparse matrix zeros probabilities unusable (we’ll need to do smoothing) w n-1 wnwn

35  P(I want to eat British food) = P(I| ) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) =.25*.32*.65*.26*.001*.60 = (different from textbook)  vs. I want to eat Chinese food = Example

36 Note on Example  Probabilities seem to capture “syntactic” facts, “world knowledge” eat is often followed by a NP British food is not too popular

37 What do we learn about the language?  What's being captured with... P(want | I) =.32 P(to | want) =.65 P(eat | to) =.26 P(food | Chinese) =.56 P(lunch | eat) =.055

38 Some Observations  P(I | I)  P(want | I)  P(I | food)  I I I want  I want I want to  The food I want is

39  What about P(I | I) =.0023 I I I I want P(I | want) =.0025 I want I want P(I | food) =.013 the kind of food I want is...

40 To avoid underflow use Logs  You don’t really do all those multiplies. The numbers are too small and lead to underflows  Convert the probabilities to logs and then do additions.  To get the real probability (if you need it) go back to the antilog.

41 Generation  Choose N-Grams according to their probabilities and string them together

42 BERP  I want want to to eat eat Chinese Chinese food food.

43 Some Useful Observations  A small number of events occur with high frequency You can collect reliable statistics on these events with relatively small samples  A large number of events occur with small frequency You might have to wait a long time to gather statistics on the low frequency events

44 Some Useful Observations  Some zeroes are really zeroes Meaning that they represent events that can’t or shouldn’t occur  On the other hand, some zeroes aren’t really zeroes They represent low frequency events that simply didn’t occur in the corpus

45 Problem  Let’s assume we’re using N-grams  How can we assign a probability to a sequence where one of the component n- grams has a value of zero  Assume all the words are known and have been seen Go to a lower order n-gram Back off from bigrams to unigrams Replace the zero with something else

46 Add-One  Make the zero counts 1.  Justification: They’re just events you haven’t seen yet. If you had seen them you would only have seen them once. so make the count equal to 1.

47 Add-one: Example unsmoothed bigram counts: unsmoothed normalized bigram probabilities: 1 st word 2 nd word

48 Add-one: Example (con’t) add-one smoothed bigram counts: add-one normalized bigram probabilities:

49 The example again unsmoothed bigram counts: V= 1616 word types V= 1616 Smoothed P(I eat) = (C(I eat) + 1) / (nb bigrams starting with “I” + nb of possible bigrams starting with “I”) = (13 + 1) / ( ) =

50 Smoothing and N-grams  Add-One Smoothing add 1 to all frequency counts  Bigram p(w n |w n-1 ) = (C(w n-1 w n )+1)/(C(w n-1 )+V) (C(w n-1 w n )+1)* C(w n-1 ) /(C(w n-1 )+V)  Frequencies Remarks: add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786  338 = ( ) * 1215 / ( )

51 Problem with add-one smoothing  bigrams starting with Chinese are boosted by a factor of 8 ! (1829 / 213) unsmoothed bigram counts: add-one smoothed bigram counts: 1 st word

52 Problem with add-one smoothing (con’t)  Data from the AP from (Church and Gale, 1991) Corpus of 22,000,000 bigrams Vocabulary of 273,266 words (i.e. 74,674,306,756 possible bigrams) 74,671,100,000 bigrams were unseen And each unseen bigram was given a frequency of f MLE f empirical f add-one too high too low Freq. from training data Freq. from held-out data Add-one smoothed freq.  Total probability mass given to unseen bigrams = (74,671,100,000 x ) / 22,000,000 ~99.96 !!!!

53 Smoothing and N-grams  Witten-Bell Smoothing equate zero frequency items with frequency 1 items use frequency of things seen once to estimate frequency of things we haven’t seen yet smaller impact than Add-One  Unigram a zero frequency word (unigram) is “an event that hasn’t happened yet” count the number of words (T) we’ve observed in the corpus (Number of types) p(w) = T/(Z*(N+T))  w is a word with zero frequency  Z = number of zero frequency words  N = size of corpus

54 Distributing  The amount to be distributed is  The number of events with count zero  So distributing evenly gets us

55 Smoothing and N-grams  Bigram p(w n |w n-1 ) = C(w n-1 w n )/C(w n-1 )(original) p(w n |w n-1 ) = T(w n-1 )/(Z(w n-1 )*(T(w n-1 )+N)) for zero bigrams (after Witten-Bell)  T(w n-1 ) = number of bigrams beginning with w n-1  Z(w n-1 ) = number of unseen bigrams beginning with w n-1  Z(w n-1 ) = total number of possible bigrams beginning with w n-1 minus the ones we’ve seen  Z(w n-1 ) = V - T(w n-1 ) T(w n-1 )/ Z(w n-1 ) * C(w n-1 )/(C(w n-1 )+ T(w n-1 ))  estimated zero bigram frequency p(w n |w n-1 ) = C(w n-1 w n )/(C(w n-1 )+T(w n-1 ))  for non-zero bigrams (after Witten-Bell)

56 Smoothing and N-grams  Witten-Bell Smoothing use frequency (count) of things seen once to estimate frequency (count) of things we haven’t seen yet  Bigram T(w n-1 )/ Z(w n-1 ) * C(w n-1 )/(C(w n-1 )+ T(w n-1 )) estimated zero bigram frequency (count)  T(w n-1 ) = number of bigrams beginning with w n-1  Z(w n-1 ) = number of unseen bigrams beginning with w n-1 Remark: smaller changes

57 Distributing Among the Zeros  If a bigram “w x w i ” has a zero count Number of bigrams starting with wx that were not seen Actual frequency (count)of bigrams beginning with wx Number of bigram types starting with wx

58 Thank you  السلام عليكم ورحمة الله