Sequence Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Slides:



Advertisements
Similar presentations
Expectation Maximization Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.
Advertisements

Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
Advanced AI - Part II Luc De Raedt University of Freiburg WS 2004/2005 Many slides taken from Helmut Schmid.
More Probabilistic Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
LING 438/538 Computational Linguistics
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Language and Learning Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
Longfield Primary School
Hidden Markov Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Introduction CSE 1310 – Introduction to Computers and Programming
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
1 Introduction to Natural Language Processing ( ) Mutual Information and Word Classes AI-lab
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Artificial Intelligence: Natural Language
Personal Reading Procedure P2RThinking Critically P2RThinking Critically Learning Styles Learning Styles How I learn Personally How I learn Personally.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Estimating N-gram Probabilities Language Modeling.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
Natural Language Processing Statistical Inference: n-grams
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
1st Grade Sight Words. out them then your many.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Unit 1 Grammar & Usage 2. Check the answers of the assignment given by the teacher last lesson. Read the 5 points on Page 10, find out the usages of different.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
N-Grams Chapter 4 Part 2.
N-Gram Model Formulas Word sequences Chain rule of probability
CSCE 771 Natural Language Processing
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Language and Learning Introduction to Artificial Intelligence COS302
CPSC 503 Computational Linguistics
Basic concepts in probability
INF 141: Information Retrieval
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Sequence Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Administration Exams enjoyed Toronto. Letter grades for programs: A: (31) A: (31) B: (20) B: (20) C: (4) C: (4) ?: (7) ?: (7) (0 did not imply “incorrect”)

Shannon Game Sue swallowed the large green __. pepperfrog pepperfrog peapill peapillNot: ideabeige ideabeige runningvery runningvery

“AI Complete” Problem My mom told me that playing Monopoly® with toddlers was a bad idea, but I thought it would be ok. I was wrong. Billy chewed on the “Get Out of Jail Free Card”. Todd ran away with the little metal dog. Sue swallowed the large green __.

Language Modeling If we had a way of assigning probabilities to sentences, we could solve this. How? Pr(Sue swallowed the large green cat.) Pr(Sue swallowed the large green odd.) How could such a thing be learned from data?

Why Play This Game? Being able to assign likelihood to sentences a useful way of processing language. Speech recognition Criterion for comparing language models Techniques useful for other problems

Statistical Estimation To use statistical estimation: Divide data into equivalence classesDivide data into equivalence classes Estimate parameters for the different classesEstimate parameters for the different classes

Conflicting Interests Reliability Lots of data in each classLots of data in each class So, small number of classesSo, small number of classesDiscrimination All relevant distinctions madeAll relevant distinctions made So, large number of classesSo, large number of classes

End Points Unigram model: Pr(w | Sue swallowed the large green ___. ) = Pr(w) Exact match model: Pr(w | Sue swallowed the large green ___. ) = Pr(w | Sue swallowed the large green ___. ) What word would these suggest?

N-grams: Compromise N-grams are simple, powerful. Bigram model: Pr(w | Sue swallowed the large green ___. ) = Pr(w | green ___ ) Trigram model: Pr(w | Sue swallowed the large green ___. ) = Pr(w | large green ___ ) Not perfect: misses “swallowed”. pillowcrystalcatepillar pillowcrystalcatepillar IguanaSantatigers IguanaSantatigers

Aside: Syntax Can do better with a little bit of knowledge about grammar: Pr(w | Sue swallowed the large green ___. ) = Pr(w | modified by swallowed, the, green ) pilldye pilldye onepineapple onepineapple dragonbeans dragonbeans speckliquid speckliquid solutiondrink solutiondrink

Estimating Trigrams Treat sentences independently. Ok? Pr(w 1 w 2 ) Pr(w j | w j-1 w j-2 ) Pr(EOS | w j-1 w j-2 ) Simple so far.

Sparsity Pr(w| comes across) as8/10 (in Austen’s works) a1/10 more1/10 the0/10 Don’t estimate as zeros! Can use Laplace smoothing, e.g., or back off to bigram, unigram.

Unreliable Words Can’t take much stock in words only seen once (hapax legomena). Change to “UNK”. Generally a small fraction of the tokens and half the types. The boy saw the dog. 5 tokens, 4 types. 5 tokens, 4 types.

Zipf’s Law Frequency is proportional to rank. Thus, extremely long tail!

Word Frequencies in Tom Sawyer

Using Trigrams Hand me the ___ knife now. butter butter knife knife

Counts me the me the butter 88 me the knife 638 the knife the knife knife 72 the butter the butter knife 559 knife knife 7831 knife knife now 4 butter knife 9046 butter knife now 15

Markov Model Hand me me the the the butter the knife knife butter knife butter knife knife now now

General Scheme Pr(w j = x | w 1 w 2 … EOS) = Pr(w 1 w 2 … x … EOS) / sum x Pr(w 1 w 2 … x … EOS) Maximized by Pr(w 1 w 2 … x … EOS) = Pr(w 1 w 2 ) … Pr(x | w j-1 w j-2 ) Pr(w j+1 | w j-1 x) Pr(w j+2 | x w j+1 ) … Pr( EOS | w n-1 w n ) Maximized by Pr(x | w j-1 w j-2 ) Pr(w j+1 | w j-1 x) Pr(w j+2 | x w j+1 )

Mutual Information Log(Pr(x and y)/Pr(x) Pr(y)) Measures the degree to which two events are independent (how much “information” we learn about one from knowing the other).

Mutual Inf. Application Measure of strength of association between words levied: imposed vs. believed Reduces to simply Pr(levied|x) = Pr(levied, x)/Pr(x) =count(levied and x) / count (x) “imposed” has higher score.

Analogy Idea Find a linking word such that a mutual information score is maximized. Tricky to find the right word. Unclear if any word will have the right effect. traffic flows through the street water flows through the riverbed

What to Learn Reliability/discrimination tradeoff. Definition of N-gram models How to find most likely word in an N-gram model Mutual Information

Homework 7 (due 11/21) 1. Give a maximization scheme for filling in the two blanks in a sentence like “I hate it when ___ goes ___ on me.” Be somewhat rigorous to make the TA’s job easier. 2. more soon