Statistical NLP Spring 2010 Lecture 2: Language Models Dan Klein – UC Berkeley.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
The Estimation Problem How would we select parameters in the limiting case where we had ALL the data? k → l  l’ k→ l’ Intuitively, the actual frequencies.
Visual Recognition Tutorial
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
CS 188: Artificial Intelligence Fall 2009 Lecture 21: Speech Recognition 11/10/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
1 Advanced Smoothing, Evaluation of Language Models.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Statistical NLP Winter 2008 Lecture 2: Language Models Roger Levy 多謝 to Dan Klein, Jason Eisner, Joshua Goodman, Stan Chen.
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Statistical NLP Winter 2009 Lecture 2: Language Models Roger Levy 多謝 to Dan Klein, Jason Eisner, Joshua Goodman, Stan Chen.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Natural Language Processing Statistical Inference: n-grams
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CSE 517 Natural Language Processing Winter 2015 Language Models Yejin Choi Slides adapted from Dan Klein, Michael Collins, Luke Zettlemoyer, Dan Jurafsky.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
CSE 517 Natural Language Processing Winter 2015 Language Models Yejin Choi Slides adapted from Dan Klein, Michael Collins, Luke Zettlemoyer, Dan Jurafsky.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
Statistical NLP Spring 2010 Lecture 3: LMs II / Text Cat Dan Klein – UC Berkeley.
N-Grams Chapter 4 Part 2.
Statistical Language Models
Statistical NLP Spring 2011
CS 188: Artificial Intelligence Fall 2008
N-Gram Model Formulas Word sequences Chain rule of probability
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Speech recognition, machine learning
Speech Recognition: Acoustic Waves
Speech recognition, machine learning
CS249: Neural Language Model
Professor Junghoo “John” Cho UCLA
Statistical NLP Winter 2009
Presentation transcript:

Statistical NLP Spring 2010 Lecture 2: Language Models Dan Klein – UC Berkeley

Frequency gives pitch; amplitude gives volume Frequencies at each time slice processed into observation vectors s p ee ch l a b amplitude Speech in a Slide frequency …………………………………………….. a 12 a 13 a 12 a 14 a 14 ………..

The Noisy-Channel Model We want to predict a sentence given acoustics: The noisy channel approach: Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions Language model: Distributions over sequences of words (sentences)

Acoustically Scored Hypotheses the station signs are in deep in english the stations signs are in deep in english the station signs are in deep into english the station 's signs are in deep in english the station signs are in deep in the english the station signs are indeed in english the station 's signs are indeed in english the station signs are indians in english the station signs are indian in english the stations signs are indians in english the stations signs are indians and english-14815

ASR System Components source P(w) w a decoder observed argmax P(w|a) = argmax P(a|w)P(w) w w wa best channel P(a|w) Language ModelAcoustic Model

Translation: Codebreaking?  “Also knowing nothing official about, but having guessed and inferred considerable about, the powerful new mechanized methods in cryptography—methods which I believe succeed even when one does not know what language has been coded—one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ”  Warren Weaver (1955:18, quoting a letter he wrote in 1947)

MT Overview 7

MT System Components source P(e) e f decoder observed argmax P(e|f) = argmax P(f|e)P(e) e e ef best channel P(f|e) Language ModelTranslation Model

Other Noisy-Channel Processes  Spelling Correction  Handwriting recognition  OCR  More…

Probabilistic Language Models  Goal: Assign useful probabilities P(x) to sentences x  Input: many observations of training sentences x  Output: system capable of computing P(x)  Probabilities should broadly indicate plausibility of sentences  P(I saw a van) >> P(eyes awe of an)  Not grammaticality: P(artichokes intimidate zippers)  0  In principle, “plausible” depends on the domain, context, speaker…  One option: empirical distribution over training sentences?  Problem: doesn’t generalize (at all)  Two aspects of generalization  Decomposition: break sentences into small pieces which can be recombined in new ways (conditional independence)  Smoothing: allow for the possibility of unseen pieces

N-Gram Model Decomposition  Chain rule: break sentence probability down  Impractical to condition on everything before  P(??? | Turn to page 134 and look at the picture of the) ?  N-gram models: assume each word depends only on a short linear history  Example:

N-Gram Model Parameters  The parameters of an n-gram model:  The actual conditional probability estimates, we’ll call them   Obvious estimate: relative frequency (maximum likelihood) estimate  General approach  Take a training set X and a test set X’  Compute an estimate  from X  Use it to assign probabilities to other sentences, such as those in X’ the first the same the following the world … the door the * Training Counts

Higher Order N-grams? 3380 please close the door 1601 please close the window 1164 please close the new 1159 please close the gate 900 please close the browser please close the * the first the same the following the world … the door the * close the window close the door close the gap close the thread close the deal close the * Please close the door Please close the first window on the left

Unigram Models  Simplest case: unigrams  Generative process: pick a word, pick a word, … until you pick STOP  As a graphical model:  Examples:  [fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass.]  [thrift, did, eighty, said, hard, 'm, july, bullish]  [that, or, limited, the]  []  [after, any, on, consistently, hospital, lake, of, of, other, and, factors, raised, analyst, too, allowed, mexico, never, consider, fall, bungled, davison, that, obtain, price, lines, the, to, sass, the, the, further, board, a, details, machinists, the, companies, which, rivals, an, because, longer, oakes, percent, a, they, three, edward, it, currier, an, within, in, three, wrote, is, you, s., longer, institute, dentistry, pay, however, said, possible, to, rooms, hiding, eggs, approximate, financial, canada, the, so, workers, advancers, half, between, nasdaq] w1w1 w2w2 w n-1 STOP ………….

Bigram Models  Big problem with unigrams: P(the the the the) >> P(I like ice cream)!  Condition on previous single word:  Obvious that this should help – in probabilistic terms, we’re using weaker conditional independence assumptions (what’s the cost?)  Any better?  [texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen]  [outside, new, car, parking, lot, of, the, agreement, reached]  [although, common, shares, rose, forty, six, point, four, hundred, dollars, from, thirty, seconds, at, the, greatest, play, disingenuous, to, be, reset, annually, the, buy, out, of, american, brands, vying, for, mr., womack, currently, sharedata, incorporated, believe, chemical, prices, undoubtedly, will, be, as, much, is, scheduled, to, conscientious, teaching]  [this, would, be, a, record, november] w1w1 w2w2 w n-1 STOP START

Regular Languages?  N-gram models are (weighted) regular languages  Many linguistic arguments that language isn’t regular.  Long-distance effects: “The computer which I had just put into the machine room on the fifth floor ___.” (crashed)  Recursive structure  Why CAN we often get away with n-gram models?  PCFG LM (later):  [This, quarter, ‘s, surprisingly, independent, attack, paid, off, the, risk, involving, IRS, leaders, and, transportation, prices,.]  [It, could, be, announced, sometime,.]  [Mr., Toseland, believes, the, average, defense, economy, is, drafted, from, slightly, more, than, 12, stocks,.]

More N-Gram Examples

Measuring Model Quality  The game isn’t to pound out fake sentences!  Obviously, generated sentences get “better” as we increase the model order  More precisely: using ML estimators, higher order is always better likelihood on train, but not test  What we really want to know is:  Will our model prefer good sentences to bad ones?  Bad ≠ ungrammatical!  Bad  unlikely  Bad = sentences that our acoustic model really likes but aren’t the correct answer

Measuring Model Quality  The Shannon Game:  How well can we predict the next word?  Unigrams are terrible at this game. (Why?)  “Entropy”: per-word test log likelihood (misnamed) When I eat pizza, I wipe off the ____ Many children are allergic to ____ I saw a ____ grease 0.5 sauce 0.4 dust 0.05 …. mice …. the 1e wipe off the excess 1034 wipe off the dust 547 wipe off the sweat 518 wipe off the mouthpiece … 120 wipe off the grease 0 wipe off the sauce 0 wipe off the mice wipe off the *

Measuring Model Quality  Problem with “entropy”:  0.1 bits of improvement doesn’t sound so good  “Solution”: perplexity  Interpretation: average branching factor in model  Important notes:  It’s easy to get bogus perplexities by having bogus probabilities that sum to more than one over their event spaces. 30% of you will do this on HW1.  Even though our models require a stop step, averages are per actual word, not per derivation step.

Measuring Model Quality  Word Error Rate (WER)  The “right” measure:  Task error driven  For speech recognition  For a specific recognizer!  Common issue: intrinsic measures like perplexity are easier to use, but extrinsic ones are more credible Correct answer: Andy saw a part of the movie Recognizer output: And he saw apart of the movie insertions + deletions + substitutions true sentence size WER: 4/7 = 57%

Sparsity  Problems with n-gram models:  New words appear all the time:  Synaptitute  132,  multidisciplinarization  New bigrams: even more often  Trigrams or more – still worse!  Zipf’s Law  Types (words) vs. tokens (word occurences)  Broadly: most word types are rare ones  Specifically:  Rank word types by token frequency  Frequency inversely proportional to rank  Not special to language: randomly generated character strings have this property (try it!)

Parameter Estimation  Maximum likelihood estimates won’t get us very far  Need to smooth these estimates  General method (procedurally)  Take your empirical counts  Modify them in various ways to improve estimates  General method (mathematically)  Often can give estimators a formal statistical interpretation  … but not always  Approaches that are mathematically obvious aren’t always what works 3516 wipe off the excess 1034 wipe off the dust 547 wipe off the sweat 518 wipe off the mouthpiece … 120 wipe off the grease 0 wipe off the sauce 0 wipe off the mice wipe off the *

Smoothing  We often want to make estimates from sparse statistics:  Smoothing flattens spiky distributions so they generalize better  Very important all over NLP, but easy to do badly!  We’ll illustrate with bigrams today (h = previous word, could be anything). P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total allegations chargesmotionbenefits … allegationsreportsclaims charges request motionbenefits … allegationsreports claims request P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total

Puzzle: Unknown Words  Imagine we look at 1M words of text  We’ll see many thousands of word types  Some will be frequent, others rare  Could turn into an empirical P(w)  Questions:  What fraction of the next 1M will be new words?  How many total word types exist?

Language Models  In general, we want to place a distribution over sentences  Basic / classic solution: n-gram models  Question: how to estimate conditional probabilities?  Problems:  Known words in unseen contexts  Entirely unknown words  Many systems ignore this – why?  Often just lump all new words into a single UNK type

Smoothing: Add-One, Etc.  Classic solution: add counts (Laplace smoothing / Dirichlet prior)  Add-one smoothing especially often talked about  For a bigram distribution, can add counts shaped like the unigram:  Can consider hierarchical formulations: trigram is recursively centered on smoothed bigram estimate, etc [MacKay and Peto, 94]  Can be derived from Dirichlet / multinomial conjugacy: prior shape shows up as pseudo-counts  Problem: works quite poorly!

Linear Interpolation  Problem: is supported by few counts  Classic solution: mixtures of related, denser histories, e.g.:  The mixture approach tends to work better than the Dirichlet prior approach for several reasons  Can flexibly include multiple back-off contexts, not just a chain  Often multiple weights, depending on bucketed counts  Good ways of learning the mixture weights with EM (later)  Not entirely clear why it works so much better  All the details you could ever want: [Chen and Goodman, 98]

Held-Out Data  Important tool for optimizing how models generalize:  Set a small number of hyperparameters that control the degree of smoothing by maximizing the (log-)likelihood of held-out data  Can use any optimization technique (line search or EM usually easiest)  Examples: Training Data Held-Out Data Test Data k L

Held-Out Reweighting  What’s wrong with add-d smoothing?  Let’s look at some real bigram counts [Church and Gale 91]:  Big things to notice:  Add-one vastly overestimates the fraction of new bigrams  Add vastly underestimates the ratio 2*/1*  One solution: use held-out data to predict the map of c to c* Count in 22M WordsActual c* (Next 22M)Add-one’s c*Add ’s c* /7e-10~ /7e-10~ /7e-10~ /7e-10~ /7e-10~5 Mass on New9.2%~100%9.2% Ratio of 2/ ~2

Good-Turing Reweighting I  We’d like to not need held-out data (why?)  Idea: leave-one-out validation  N k : number of types which occur k times in the entire corpus  Take each of the c tokens out of corpus in turn  c “training” sets of size c-1, “held-out” of size 1  How many held-out tokens are unseen in training? N1N1  How many held-out tokens are seen k times in training?  (k+1)N k+1  There are N k words with training count k  Each should occur with expected count  (k+1)N k+1 /N k  Each should occur with probability:  (k+1)N k+1 /(cN k ) N1N1 2N 2 3N N N /N 0 /N 1 /N 2 /N 4416 /N “Training”“Held-Out”

Good-Turing Reweighting II  Problem: what about “the”? (say k=4417)  For small k, N k > N k+1  For large k, too jumpy, zeros wreck estimates  Simple Good-Turing [Gale and Sampson]: replace empirical N k with a best-fit power law once count counts get unreliable N1N1 N2N2 N3N3 N 4417 N N0N0 N1N1 N2N2 N 4416 N N1N1 N2N2 N3N3 N1N1 N2N2

Good-Turing Reweighting III  Hypothesis: counts of k should be k* = (k+1)N k+1 /N k  Katz Smoothing  Use GT discounted bigram counts (roughly – Katz left large counts alone)  Whatever mass is left goes to empirical unigram Count in 22M WordsActual c* (Next 22M)GT’s c* Mass on New9.2%

 Kneser-Ney smoothing: very successful estimator using two ideas  Idea 1: observed n-grams occur more in training than they will later:  Absolute Discounting  No need to actually have held-out data; just subtract 0.75 (or some d)  Maybe have a separate value of d for very low counts Kneser-Ney: Discounting Count in 22M WordsFuture c* (Next 22M)

Kneser-Ney: Continuation  Idea 2: Type-based fertility  Shannon game: There was an unexpected ____?  delay?  Francisco?  “Francisco” is more common than “delay”  … but “Francisco” always follows “San”  … so it’s less “fertile”  Solution: type-continuation probabilities  In the back-off model, we don’t want the probability of w as a unigram  Instead, want the probability that w is allowed in a novel context  For each word, count the number of bigram types it completes

Kneser-Ney  Kneser-Ney smoothing combines these two ideas  Absolute discounting  Lower order continuation probabilities  KN smoothing repeatedly proven effective  [Teh, 2006] shows KN smoothing is a kind of approximate inference in a hierarchical Pitman-Yor process (and better approximations are superior to basic KN)

What Actually Works?  Trigrams and beyond:  Unigrams, bigrams generally useless  Trigrams much better (when there’s enough data)  4-, 5-grams really useful in MT, but not so much for speech  Discounting  Absolute discounting, Good- Turing, held-out estimation, Witten-Bell  Context counting  Kneser-Ney construction oflower-order models  See [Chen+Goodman] reading for tons of graphs! [Graphs from Joshua Goodman]

Data >> Method?  Having more data is better…  … but so is using a better estimator  Another issue: N > 3 has huge costs in speech recognizers

Tons of Data?

Beyond N-Gram LMs  Lots of ideas we won’t have time to discuss:  Caching models: recent words more likely to appear again  Trigger models: recent words trigger other words  Topic models  A few recent ideas  Syntactic models: use tree models to capture long-distance syntactic effects [Chelba and Jelinek, 98]  Discriminative models: set n-gram weights to improve final task accuracy rather than fit training set density [Roark, 05, for ASR; Liang et. al., 06, for MT]  Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero [Mohri and Roark, 06]  Bayesian document and IR models [Daume 06]

Overview  So far: language models give P(s)  Help model fluency for various noisy-channel processes (MT, ASR, etc.)  N-gram models don’t represent any deep variables involved in language structure or meaning  Usually we want to know something about the input other than how likely it is (syntax, semantics, topic, etc)  Next: Naïve-Bayes models  We introduce a single new global variable  Still a very simplistic model family  Lets us model hidden properties of text, but only very non-local ones…  In particular, we can only model properties which are largely invariant to word order (like topic)

Language Model Samples  Unigram:  [fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, quarter]  [that, or, limited, the]  []  [after, any, on, consistently, hospital, lake, of, of, other, and, factors, raised, analyst, too, allowed, mexico, never, consider, fall, bungled, davison, that, obtain, price, lines, the, to, sass, the, the, further, board, a, details, machinists, ……]  Bigram:  [outside, new, car, parking, lot, of, the, agreement, reached]  [although, common, shares, rose, forty, six, point, four, hundred, dollars, from, thirty, seconds, at, the, greatest, play, disingenuous, to, be, reset, annually, the, buy, out, of, american, brands, vying, for, mr., womack, currently, share, data, incorporated, believe, chemical, prices, undoubtedly, will, be, as, much, is, scheduled, to, conscientious, teaching]  [this, would, be, a, record, november]  PCFG (later):  [This, quarter, ‘s, surprisingly, independent, attack, paid, off, the, risk, involving, IRS, leaders, and, transportation, prices,.]  [It, could, be, announced, sometime,.]  [Mr., Toseland, believes, the, average, defense, economy, is, drafted, from, slightly, more, than, 12, stocks,.]

Smoothing Dealing with sparsity well: smoothing / shrinkage For most histories P(w | h), relatively few observations Very intricately explored for the speech n-gram case Easy to do badly P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total allegations attackmanoutcome … allegations reports claims attack request manoutcome … allegations reports claims request P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total

Priors on Parameters Most obvious formal solution: use MAP estimate instead of ML estimate for a multinomial P(X) Maximum likelihood estimate: max P(X|  ) MAP estimate: max P(  |X) Dirichlet priors are a convenient choice Specified by a center  ’ and strength k, Dir(  ’,k) or Dir(k  ’) Mean is center, higher strength means lower variance MAP estimate is then

Smoothing: Add-One, Etc.  With a uniform prior, get estimates of the form  Add-one smoothing especially often talked about  For a bigram distribution, can use a prior centered on the empirical unigram:  Can consider hierarchical formulations in which trigram is centered on smoothed bigram estimate, etc [MacKay and Peto, 94]  Basic idea of conjugacy is convenient: prior shape shows up as pseudo-counts  Problem: works quite poorly!

Linear Interpolation Problem: is supported by few counts Classic solution: mixtures of related, denser histories, e.g.: The mixture approach tends to work better than the Dirichlet prior approach for several reasons Can flexibly include multiple back-off contexts, not just a chain Good ways of learning the mixture weights with EM (later) Not entirely clear why it works so much better All the details you could ever want: [Chen and Goodman, 98]

Held-Out Data  Important tool for calibrating how models generalize:  Set a small number of hyperparameters that control the degree of smoothing by maximizing the (log-)likelihood of held-out data  Can use any optimization technique (line search or EM usually easiest)  Examples: Training Data Held-Out Data Test Data k L

Held-Out Reweighting  What’s wrong with unigram-prior smoothing?  Let’s look at some real bigram counts [Church and Gale 91]:  Big things to notice:  Add-one vastly overestimates the fraction of new bigrams  Add vastly underestimates the ratio 2*/1*  One solution: use held-out data to predict the map of c to c* Count in 22M WordsActual c* (Next 22M)Add-one’s c*Add ’s c* /7e-10~ /7e-10~ /7e-10~ /7e-10~ /7e-10~5 Mass on New9.2%~100%9.2% Ratio of 2/ ~2

Good-Turing Reweighting I  We’d like to not need held-out data (why?)  Idea: leave-one-out validation  N k : number of types which occur k times in the entire corpus  Take each of the c tokens out of corpus in turn  c “training” sets of size c-1, “held-out” of size 1  How many held-out tokens are unseen in training? N1N1  How many held-out tokens are seen k times in training?  (k+1)N k+1  There are N k words with training count k  Each should occur with expected count  (k+1)N k+1 /N k  Each should occur with probability:  (k+1)N k+1 /(cN k ) N1N1 2N 2 3N N N /N 0 /N 1 /N 2 /N 4416 /N “Training”“Held-Out”

Good-Turing Reweighting II  Problem: what about “the”? (say k=4417)  For small k, N k > N k+1  For large k, too jumpy, zeros wreck estimates  Simple Good-Turing [Gale and Sampson]: replace empirical N k with a best-fit power law once count counts get unreliable N1N1 N2N2 N3N3 N 4417 N N0N0 N1N1 N2N2 N 4416 N N1N1 N2N2 N3N3 N1N1 N2N2

Good-Turing Reweighting III  Hypothesis: counts of k should be k* = (k+1)N k+1 /N k  Katz Smoothing  Use GT discounted bigram counts (roughly – Katz left large counts alone)  Whatever mass is left goes to empirical unigram Count in 22M WordsActual c* (Next 22M)GT’s c* Mass on New9.2%

Kneser-Ney smoothing: very successful but slightly ad hoc estimator Idea: observed n-grams occur more in training than they will later: Absolute Discounting Save ourselves some time and just subtract 0.75 (or some d) Maybe have a separate value of d for very low counts Kneser-Ney: Discounting Avg in Next 22M Good-Turing c*Count in 22M Words

Kneser-Ney: Continuation  Something’s been very broken all this time  Shannon game: There was an unexpected ____?  delay?  Francisco?  “Francisco” is more common than “delay”  … but “Francisco” always follows “San”  Solution: Kneser-Ney smoothing  In the back-off model, we don’t want the probability of w as a unigram  Instead, want the probability that w is allowed in this novel context  For each word, count the number of bigram types it completes

Kneser-Ney Kneser-Ney smoothing combines these two ideas Absolute discounting Lower order models take a special form KN smoothing repeatedly proven effective But we’ve never been quite sure why And therefore never known how to make it better [Teh, 2006] shows KN smoothing is a kind of approximate inference in a hierarchical Pitman-Yor process (and better approximations are superior to basic KN)

What Actually Works?  Trigrams:  Unigrams, bigrams too little context  Trigrams much better (when there’s enough data)  4-, 5-grams often not worth the cost (which is more than it seems, due to how speech recognizers are constructed)  Note: for MT, 5+ often used!  Good-Turing-like methods for count adjustment  Absolute discounting, Good- Turing, held-out estimation, Witten-Bell  Kneser-Ney equalization for lower-order models  See [Chen+Goodman] reading for tons of graphs! [Graphs from Joshua Goodman]

Data >> Method? Having more data is better… … but so is using a better model Another issue: N > 3 has huge costs in speech recognizers

Beyond N-Gram LMs Lots of ideas we won’t have time to discuss: Caching models: recent words more likely to appear again Trigger models: recent words trigger other words Topic models A few recent ideas Syntactic models: use tree models to capture long-distance syntactic effects [Chelba and Jelinek, 98] Discriminative models: set n-gram weights to improve final task accuracy rather than fit training set density [Roark, 05, for ASR; Liang et. al., 06, for MT] Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero [Mohri and Roark, 06] Bayesian document and IR models [Daume 06]

Smoothing  Estimating multinomials  We want to know what words follow some history h  There’s some true distribution P(w | h)  We saw some small sample of N words from P(w | h)  We want to reconstruct a useful approximation of P(w | h)  Counts of events we didn’t see are always too low (0 < N P(w | h))  Counts of events we did see are in aggregate to high  Example:  Two issues:  Discounting: how to reserve mass what we haven’t seen  Interpolation: how to allocate that mass amongst unseen events P(w | denied the) 3 allegations 2 reports 1 claims 1 speculation … 1 request 13 total P(w | confirmed the) 1 report

Add-One Estimation  Idea: pretend we saw every word once more than we actually did [Laplace]  Corresponds to a uniform prior over vocabulary  Think of it as taking items with observed count r > 1 and treating them as having count r* < r  Holds out V/(N+V) for “fake” events  N 1+ /N of which is distributed back to seen words  N 0 /(N+V) actually passed on to unseen words (nearly all!)  Actually tells us not only how much to hold out, but where to put it  Works astonishingly poorly in practice  Quick fix: add some small  instead of 1 [Lidstone, Jefferys]  Slightly better, holds out less mass, still a bad idea

How Much Mass to Withhold?  Remember the key discounting problem:  What count should r* should we use for an event that occurred r times in N samples?  r is too big  Idea: held-out data [Jelinek and Mercer]  Get another N samples  See what the average count of items occuring r times is (e.g. doubletons on average might occur 1.78 times)  Use those averages as r*  Much better than add-one, etc.

Speech in a Slide (or Three)  Speech input is an acoustic wave form s p ee ch l a b Graphs from Simon Arnfield’s web tutorial on speech, Sheffield: Some later bits from Joshua Goodman’s LM tutorial “l” to “a” transition:

 Frequency gives pitch; amplitude gives volume  sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec)  Fourier transform of wave displayed as a spectrogram  darkness indicates energy at each frequency s p ee ch l a b frequency amplitude Spectral Analysis

Acoustic Feature Sequence  Time slices are translated into acoustic feature vectors (~15 real numbers per slice, details later in the term)  Now we have to figure out a mapping from sequences of acoustic observations to words. frequency …………………………………………….. a 12 a 13 a 12 a 14 a 14 ………..

The Speech Recognition Problem  We want to predict a sentence given an acoustic sequence:  The noisy channel approach:  Build a generative model of production (encoding)  To decode, we use Bayes’ rule to write  Now, we have to find a sentence maximizing this product  Why is this progress?

Beyond N-Gram LMs  Caching Models  Recent words more likely to appear again  Can be disastrous in practice for speech (why?)  Skipping Models  Clustering Models: condition on word classes when words are too sparse  Trigger Models: condition on bag of history words (e.g., maxent)  Structured Models: use parse structure (we’ll see these later)