I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.

Slides:

Advertisements

Similar presentations

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Albert Gatt Corpora and Statistical Methods – Lecture 7.

SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.

Smoothing Techniques – A Primer

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

Naïve Bayes for WSD w: ambiguous word S = {s 1, s 2, …, s n } senses for w V = {v 1, v 2, v 3 } words used as contextual features for disambiguation We.

Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

I256 Applied Natural Language Processing Fall 2009 Lecture 9 Review Barbara Rosario.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.

Language Modeling Approaches for Information Retrieval Rong Jin.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

Crash Course on Machine Learning

1 Advanced Smoothing, Evaluation of Language Models.

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Naive Bayes Classifier

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Chapter 6: N-GRAMS Heshaam Faili University of Tehran.

Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.

9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.

1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab

Lecture 4 Ngrams Smoothing

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Statistical NLP Winter 2009

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

Natural Language Processing Statistical Inference: n-grams

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.

Language Model for Machine Translation Jang, HaYoung.

N-Grams Chapter 4 Part 2.

Statistical Language Models

Statistical Models for Automatic Speech Recognition

Lecture 15: Text Classification & Naive Bayes

CAP 5636 – Advanced Artificial Intelligence

Statistical Models for Automatic Speech Recognition

CS 188: Artificial Intelligence

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

CS590I: Information Retrieval

Presentation transcript:

I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario

Today Exercises –Design a graphical model –Learn parameters for naïve bayes Language models (n-grams) Sparse data & smoothing methods

Exercise Let’s design a GM Problem: topic and subtopics classification –Each document has one broad semantic topic (e.g. politics, sports, etc.) –There are several subtopics in each document –Example: a sport document can contain a part describing a match, a part describing the location of the match and one on the persons

Exercise The goal is to classify the overall topic (T) of the documents and all the subtopics (ST i ) Assumptions: The subtopics ST i depend on the topic of the T document The subtopics ST i are conditionally independent of each other (given T) The words of the document w j depend on the subtopic ST i and are conditionally independent of each other (given ST i ) –For simplicity assume as many topics nodes as there are words How would a GM encoding this assumptions look like? –Variable? Edges? Joint Pb distributions?

Exercise What about now if the words of the document depend also directly from the topic T? –The subtopic persons may be quite different if the overall topic is sport or politics What about now if there is an ordering in the subtopics, i.e. ST i depend on T and also ST i-1

Naïve Bayes for topic classification w1w1 w2w2 wnwn T Recall the general joint probability distribution: P(X 1,..X N ) =  i P(X i | Par(X i ) ) P(T, w 1..w n ) = P(T) P(w 1 | T) P(w 2 | T) … P(w n | T )= = P(T)  i P(w i | T) Inference (Testing): Compute conditional probabilities: P(T | w 1, w 2,..w n ) Estimation (Training): Given data, estimate: P(T), P(w i | T)

Topic = sport (num words = 15) D1: 2009 open season D2: against Maryland Sept D3: play six games D3: schedule games weekends D4: games games games Exercise Topic = politics (num words = 19) D1: Obama hoping rally support D2: billion stimulus package D3: House Republicans tax D4: cuts spending GOP games D4: Republicans obama open D5: political season P(obama | T = politics) = P(w= obama, T = politcs)/ P(T = politcs) = (c(w= obama, T = politcs)/ 34 )/(19/34) = 2/19 P(obama | T = sport) = P(w= obama, T = sport)/ P(T = sport) = (c(w= obama, T = sport)/ 34 )/(15/34) = 0 P(season | T=politics) = P(w=season, T=politcs)/ P(T=politcs) = (c(w=season, T=politcs)/ 34 )/(19/34) = 1/19 P(season | T= sport) = P(w=season, T= sport)/ P(T= sport) = (c(w=season, T= sport)/ 34 )/(15/34) = 1/19 P(republicans|T=politics)=P(w=republicans,T=politcs)/ P(T=politcs)=c(w=republicans,T=politcs)/19 = 2/19 P(republicans|T= sport)=P(w=republicans,T= sport)/ P(T= sport)=c(w=republicans,T= sport)/19 = 0/15 = 0 Estimate: for each w i, T j

Exercise: inference What is the topic of new documents: –Republicans obama season –games season open –democrats kennedy house

Exercise: inference Recall: Bayes decision rule Decide T j if P(T j | c) > P(T k | c) for T j ≠ T k c is the context, here the words of the documents We want to assign the topic T for which T’ = argmax T j P(T j | c)

Exercise: Bayes classification We compute P(T j | c) with Bayes rule Because of the dependencies encoded in the GM Bayes rule This GM

Exercise: Bayes classification New sentences: republicans obama season T = politics? P(politics I c) = P(politics) P(Republicans|politics) P(obama|politics) P(season| politics) = 19/34 2/19 2/19 1/19 > 0 T = sport? P(sport I c) = P(sport) P(Republicans|sport) P(obama| sport) P(season| sport) = 15/ /19 = 0 That is, for each T j we calculate and see which one is higher Choose T = politics

Exercise: Bayes classification That is, for each T j we calculate and see which one is higher New sentences: democrats kennedy house T = politics? P(politics I c) = P(politics) P(democrats |politics) P(kennedy|politics) P(house| politics) = 19/ /19 = 0 democrats kennedy: unseen words  data sparsity How can we address this?

Today Exercises –Design of a GM –Learn parameters Language models (n-grams) Sparse data & smoothing methods

Language Models Model to assign scores to sentences Probabilities should broadly indicate likelihood of sentences –P( I saw a van) >> P( eyes awe of an) Not grammaticality –P(artichokes intimidate zippers) ≈ 0 In principle, “likely” depends on the domain, context, speaker… Adapted from Dan Klein’s CS 288 slides

Language models Related: the task of predicting the next word Can be useful for –Spelling corrections I need to notified the bank –Machine translations –Speech recognition –OCR (optical character recognition) –Handwriting recognition –Augmentative communication Computer systems to help the disabled in communication –For example, systems that let choose words with hand movements

Language Models Model to assign scores to sentences –Sentence: w 1, w 2, … w n –Break sentence probability down with chain rule (no loss of generality) –Too many histories!

wiwi w1w1 Markov assumption: n-gram solution Markov assumption: only the prior local context - -- the last “few” n words– affects the next word N-gram models: assume each word depends only on a short linear history –Use N-1 words to predict the next one wiwi W i-2

n-gram: Unigrams (n = 1) From Dan Klein’s CS 288 slides

n-gram: Bigrams (n = 2) From Dan Klein’s CS 288 slides

n-gram: Trigrams (n = 3) W1W1 W2W2 WNWN... W3W3 From Dan Klein’s CS 288 slides

Choice of n In principle we would like the n of the n-gram to be large –green –large green –the large green –swallowed the large green –swallowed should influence the choice of the next word (mountain is unlikely, pea more likely) –The crocodile swallowed the large green.. –Mary swallowed the large green.. –And so on…

Discrimination vs. reliability Looking at longer histories (large n) should allows us to make better prediction (better discrimination) But it’s much harder to get reliable statistics since the number of parameters to estimate becomes too large – The larger n, the larger the number of parameters to estimate, the larger the data needed to do statistically reliable estimations

Language Models N size of vocabulary Unigrams Bi-grams Tri-grams For each w i calculate P(w i ): N of such numbers: N parameters For each w i, w j w k calculate P(w i | w j, w k ): NxNxN parameters For each w i, w j calculate P(w i | w j, ): NxN parameters

N-grams and parameters ModelParameters Bigram model20,000 2 = 400 million Trigram model20,000 3 = 8 trillion Four-gram model20,000 4 = 1.6 x Assume we have a vocabulary of 20,000 words Growth in number of parameters for n-grams models:

Sparsity Zipf’s law: most words are rare –This makes frequency-based approaches to language hard New words appear all the time, new bigrams more often, trigrams or more, still worse! These relative frequency estimates are the MLE (maximum likelihood estimates): choice of parameters that give the highest probability to the training corpus

Sparsity The larger the number of parameters, the more likely it is to get 0 probabilities Note also the product: If we have one 0 for un unseen events, the 0 propagates and gives us 0 probabilities for the whole sentence

Tackling data sparsity Discounting or smoothing methods –Change the probabilities to avoid zeros –Remember pd have to sum to 1 –Decrease the non zeros probabilities (seen events) and put the rest of the probability mass to the zeros probabilities (unseen events)

Smoothing From Dan Klein’s CS 288 slides

Smoothing Put probability mass on “unseen events” Add one /delta (uniform prior) Add one /delta (unigram prior) Linear interpolation ….

Smoothing: Combining estimators Make linear combination of multiple probability estimates –(Providing that we weight the contribution of each of them so that the result is another probability function) Linear interpolation or mixture models

Smoothing: Combining estimators Back-off models –Special case of linear interpolation

Smoothing: Combining estimators Back-off models: trigram version

Beyond N-Gram LMs Discriminative models (n-grams are generative model) Grammar based –Syntactic models: use tree models to capture long- distance syntactic effects –Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero Lexical –Word forms –Unknown words Semantic based –Semantic classes: do statistic at the semantic classes level (eg., WordNet) More data (Web)

Summary Given a problem (topic and subtopic classification, language models): design a GM Learn parameters from data But: data sparsity Need to smooth the parameters