Statistical NLP Winter 2009 Lecture 2: Language Models Roger Levy 多謝 to Dan Klein, Jason Eisner, Joshua Goodman, Stan Chen.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Statistical NLP Winter 2008 Lecture 2: Language Models Roger Levy 多謝 to Dan Klein, Jason Eisner, Joshua Goodman, Stan Chen.
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
Introduction to N-grams Language Modeling. Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation: P(high.
Estimating N-gram Probabilities Language Modeling.
Inference: Probabilities and Distributions Feb , 2012.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Natural Language Processing Statistical Inference: n-grams
Statistical NLP Spring 2010 Lecture 2: Language Models Dan Klein – UC Berkeley.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Lecture 13: Language Models for IR
Statistical Language Models
In the name of God Language Modeling Mohammad Bahrani Feb 2011.
Language Models for Information Retrieval
Statistical NLP Spring 2011
N-Gram Model Formulas Word sequences Chain rule of probability
CS621/CS449 Artificial Intelligence Lecture Notes
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Introduction to N-grams
Statistical NLP Winter 2009
Presentation transcript:

Statistical NLP Winter 2009 Lecture 2: Language Models Roger Levy 多謝 to Dan Klein, Jason Eisner, Joshua Goodman, Stan Chen

Language and probability Birth of computational linguistics: 1950s & machine translation This was also the birth-time of cognitive science, and an extremely exciting time for mathematics Birth of information theory (Shannon 1948) Birth of formal language theory (Chomsky 1956) However, the role of probability and information theory was rapidly dismissed from formal linguistics …the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. (Chomsky 1967) (Also see Miller 1957, or Miller 2003 for a retrospective)

Language and probability (2) Why was the role of probability in formal accounts of language knowledge dismissed? Chomsky (1957): Neither of these sentences has ever appeared before in human discourse None of the substrings has either Hence the probabilities P(W 1 ) and P(W 2 ) cannot be relevant to a native English speaker’s knowledge of the difference between these sentences 1.Colorless green ideas sleep furiously. 2.Furiously sleep ideas green colorless.

Language and probability (3) But don’t be so sure… There’s more than one way to color an idea Language users must be able to generalize beyond their input Statistical inference is the study of such generalization Pereira (2000; see also Saul & Pereira, 1997) use a class-based bigram model to model (1) and (2) before: Maybe probabilities aren’t such a useless model of language knowledge after all

Language and probability (4) A class-based model looks something like this: The c i are unseen variables that have to be introduced into the model, either through: Annotation Unsupervised learning We’ll cover these later in the course For now, we’ll focus on the basic problem of the language model: P(W)

The uses of language models Once we have a language model--P(W)--what can we do with it??? Answer: Language models are core component of many applications where task is to recover an utterance from an input This is done through Bayes’ rule Language Model

Some practical uses of language models…

The Speech Recognition Problem We want to predict a sentence given an acoustic sequence: The noisy channel approach: Build a generative model of production (encoding) To decode, we use Bayes’ rule to write Now, we have to find a sentence maximizing this product Why is this progress?

MT System Components source P(e) e f decoder observed argmax P(e|f) = argmax P(f|e)P(e) e e ef best channel P(f|e) Language ModelTranslation Model

Computational Psycholinguistics Language models (structured & otherwise) can help us theorize about the way probabilistic expectations influence sentence comprehension & production (later in the course) auf den Berg PP ??????? V NP? PP-goal? PP-loc? Verb? ADVP? die Gruppe VP NP S Vfin Er hat

Other Noisy-Channel Processes Handwriting recognition OCR Spelling Correction & Proofreading Translation

Uses of language models, summary So language models are both practically and theoretically useful! Now, how do we estimate them and use them for inference? We’ll kick off this course by looking at the simplest type of model

Probabilistic Language Models Want to build models which assign scores to sentences. P(I saw a van) >> P(eyes awe of an) Not really grammaticality: P(artichokes intimidate zippers)  0 One option: empirical distribution over sentences? Problem: doesn’t generalize (at all) Two major components of generalization Backoff: sentences generated in small steps which can be recombined in other ways Discounting: allow for the possibility of unseen events

N-Gram Language Models No loss of generality to break sentence probability down with the chain rule Too many histories! P(??? | No loss of generality to break sentence) ? P(??? | the water is so transparent that) ? N-gram solution: assume each word depends only on a short linear history

Unigram Models Simplest case: unigrams Generative process: pick a word, pick a word, … As a graphical model: To make this a proper distribution over sentences, we have to generate a special STOP symbol last. (Why?) Examples: [fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass.] [thrift, did, eighty, said, hard, 'm, july, bullish] [that, or, limited, the] [] [after, any, on, consistently, hospital, lake, of, of, other, and, factors, raised, analyst, too, allowed, mexico, never, consider, fall, bungled, davison, that, obtain, price, lines, the, to, sass, the, the, further, board, a, details, machinists, the, companies, which, rivals, an, because, longer, oakes, percent, a, they, three, edward, it, currier, an, within, in, three, wrote, is, you, s., longer, institute, dentistry, pay, however, said, possible, to, rooms, hiding, eggs, approximate, financial, canada, the, so, workers, advancers, half, between, nasdaq] w1w1 w2w2 w n-1 STOP ………….

Bigram Models Big problem with unigrams: P(the the the the) >> P(I like ice cream)! Condition on previous word: Any better? [texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen] [outside, new, car, parking, lot, of, the, agreement, reached] [although, common, shares, rose, forty, six, point, four, hundred, dollars, from, thirty, seconds, at, the, greatest, play, disingenuous, to, be, reset, annually, the, buy, out, of, american, brands, vying, for, mr., womack, currently, sharedata, incorporated, believe, chemical, prices, undoubtedly, will, be, as, much, is, scheduled, to, conscientious, teaching] [this, would, be, a, record, november] w1w1 w2w2 w n-1 STOP START

More N-Gram Examples You can have trigrams, quadrigrams, etc.

Regular Languages? N-gram models are (weighted) regular processes Why can’t we model language like this? Linguists have many arguments why language can’t be merely regular. Long-distance effects: “The computer which I had just put into the machine room on the fifth floor crashed.” Why CAN we often get away with n-gram models? PCFG language models (hopefully later in the course): [This, quarter, ‘s, surprisingly, independent, attack, paid, off, the, risk, involving, IRS, leaders, and, transportation, prices,.] [It, could, be, announced, sometime,.] [Mr., Toseland, believes, the, average, defense, economy, is, drafted, from, slightly, more, than, 12, stocks,.]

Evaluation What we want to know is: Will our model prefer good sentences to bad ones? That is, does it assign higher probability to “real” or “frequently observed” sentences than “ungrammatical” or “rarely observed” sentences? As a component of Bayesian inference, will it help us discriminate correct utterances from noisy inputs?

Measuring Model Quality For Speech: Word Error Rate (WER) The “right” measure: Task error driven For speech recognition For a specific recognizer! For general evaluation, we want a measure which references only good text, not mistake text Correct answer: Andy saw a part of the movie Recognizer output: And he saw apart of the movie insertions + deletions + substitutions true sentence size WER: 4/7 = 57%

Measuring Model Quality The Shannon Game: How well can we predict the next word? Unigrams are terrible at this game. (Why?) Likelihood The “Entropy” Measure Really: average cross-entropy of a text according to a model When I order pizza, I wipe off the ____ Many children are allergic to ____ I saw a ____ grease 0.5 sauce 0.4 dust 0.05 …. mice …. the 1e-100 the corpus on assumption of independence between sentences

Measuring Model Quality Problem with (cross-)entropy: 0.1 bits of improvement doesn’t sound so good Solution: perplexity Really just the same thing as cross-entropy or likelihood. [Minor technical note: even though our models require a stop step, people typically don’t count it as a symbol when taking these averages.]

Sparsity Problems with n-gram models: New words appear all the time: Synaptitute 132, fuzzificational New bigrams: even more often Trigrams or more – still worse! Zipf’s Law Types (words) vs. tokens (word occurences) Broadly: most word types are rare ones Specifically: Rank word types by token frequency Frequency inversely proportional to rank Not special to language: randomly generated character strings have this property (try it!)

Smoothing We often want to make estimates from sparse statistics: Smoothing flattens spiky distributions so they generalize better Very important all over NLP, but easy to do badly! Illustration with bigrams (h = previous word, could be anything). P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total allegations attackmanoutcome … allegationsreportsclaims attack request manoutcome … allegationsreports claims request P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total

Smoothing Estimating multinomials We want to know what words follow some history h There’s some true distribution P(w | h) We saw some small sample of N words from P(w | h) We want to reconstruct a useful approximation of P(w | h) Counts of events we didn’t see are always too low (0 < N P(w | h)) Counts of events we did see are in aggregate to high Example: Two issues: Discounting: how to reserve mass what we haven’t seen Interpolation: how to allocate that mass amongst unseen events P(w | denied the) 3 allegations 2 reports 1 claims 1 speculation … 1 request 13 total P(w | affirmed the) 1 award

Smoothing: Add-  (for bigram models) One class of smoothing functions (discounting): Add-one / delta: If you know Bayesian statistics, this is equivalent to assuming a uniform prior Another (better?) alternative: assume a unigram prior: How would we estimate the unigram model? c number of word tokens in training data c(w)c(w) count of word w in training data c(w -1,w) joint count of the w -1,w bigram V total vocabulary size (assumed known) NkNk number of word types with count k

Linear Interpolation One way to ease the sparsity problem for n-grams is to use less-sparse n-1-gram estimates General linear interpolation: Having a single global mixing constant is generally not ideal: A better yet still simple alternative is to vary the mixing constant as a function of the conditioning context