1 Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel) AI-lab 2003.10.08.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
Fall 2001 EE669: Natural Language Processing 1 Lecture 6: N-gram Models and Sparse Data (Chapter 6 of Manning and Schutze, Chapter 6 of Jurafsky and Martin,
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Probability (cont.). Assigning Probabilities A probability is a value between 0 and 1 and is written either as a fraction or as a proportion. For the.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Natural Language Understanding
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
Introduction to Automatic Speech Recognition
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Deep Learning Neural Network with Memory (1)
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
1 Introduction to Natural Language Processing ( ) Mutual Information and Word Classes AI-lab
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Collocations. Definition Of Collocation (wrt Corpus Literature) A collocation is defined as a sequence of two or more consecutive words, that has characteristics.
Yuya Akita , Tatsuya Kawahara
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
CSA3202 Human Language Technology HMMs for POS Tagging.
Source Coding Efficient Data Representation A.J. Han Vinck.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
A COMPARISON OF HAND-CRAFTED SEMANTIC GRAMMARS VERSUS STATISTICAL NATURAL LANGUAGE PARSING IN DOMAIN-SPECIFIC VOICE TRANSCRIPTION Curry Guinn Dave Crist.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Natural Language Processing Statistical Inference: n-grams
Statistical Models for Automatic Speech Recognition Lukáš Burget.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab
Mutual Information, Joint Entropy & Conditional Entropy
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
N-Gram Model Formulas Word sequences Chain rule of probability
CSCI 5832 Natural Language Processing
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CPSC 503 Computational Linguistics
Speech recognition, machine learning
Speech Recognition: Acoustic Waves
CSCI 5582 Artificial Intelligence
Speech recognition, machine learning
Presentation transcript:

1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab

2 The Noisy Channel Prototypical case: Input Output (noisy) The channel 0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,... Model: probability of error (noise): Example: p(0|1) =.3 p(1|1) =.7 p(1|0) =.4 p(0|0) =.6 The Task: known: the noisy output; want to know: the input (decoding)

3 Noisy Channel Applications OCR –straightforward: text  print (adds noise), scan  image Handwriting recognition –text  neurons, muscles ( “ noise ” ), scan/digitize  image Speech recognition (dictation, commands, etc.) –text  conversion to acoustic signal ( “ noise ” )  acoustic waves Machine Translation –text in target language  translation ( “ noise ” )  source language Also: Part of Speech Tagging –sequence of tags  selection of word forms  text

4 Noisy Channel: The Golden Rule of... OCR, ASR, HR, MT,... Recall: p(A|B) = p(B|A) p(A) / p(B) (Bayes formula) A best = argmax  p(B|A) p(A) (The Golden Rule) p(B|A): the acoustic/image/translation/lexical model –application-specific name –will explore later p(A): the language model

5 The Perfect Language Model Sequence of word forms [forget about tagging for the moment] Notation: A  W = (w 1,w 2,w 3,...,w d ) The big (modeling) question: p(W) =  Well, we know (Bayes/chain rule  ): p(W) = p(w 1,w 2,w 3,...,w d ) = = p(w 1 ) · p(w 2 |w 1 )  ·  p(w 3 |w 1,w 2 ) ·  ·  p(w d |w 1,w 2,...,w d-1 ) Not practical (even short W  too many parameters)

6 Markov Chain Unlimited memory (cf. previous foil): –for w i, we know all its predecessors w 1,w 2,w 3,...,w i-1 Limited memory: –we disregard “ too old ” predecessors –remember only k previous words: w i-k,w i-k+1,...,w i-1 –called “ k th order Markov approximation ” + stationary character (no change over time): p(W)   i=1..d p(w i |w i-k,w i-k+1,...,w i-1 ), d = |W|

7 n-gram Language Models (n-1) th order Markov approximation  n-gram LM: p(W)  df  i=1..d p(w i |w i-n+1,w i-n+2,...,w i-1 ) ! In particular (assume vocabulary |V| = 60k): 0-gram LM: uniform model, p(w) = 1/|V|, 1 parameter 1-gram LM: unigram model, p(w),6  10 4 parameters 2-gram LM: bigram model, p(w i |w i-1 ) 3.6  10 9 parameters 3-gram LM: trigram model,p(w i |w i-2,w i-1 ) 2.16  parameters prediction history

8 LM: Observations How large n? –nothing is enough (theoretically) –but anyway: as much as possible (  close to “ perfect ” model) –empirically: 3 parameter estimation? (reliability, data availability, storage space,...) 4 is too much: |V|=60k   parameters but: 6-7 would be (almost) ideal (having enough data): in fact, one can recover original from 7-grams! Reliability ~ (1 / Detail) (  need compromise) For now, keep word forms (no “ linguistic ” processing)

9 The Length Issue  n;  w  n p(w) = 1   n=1..   w  n p(w) >> 1 (  ) We want to model all sequences of words –for “ fixed ” length tasks: no problem - n fixed, sum is 1 tagging, OCR/handwriting (if words identified ahead of time) –for “ variable” length tasks: have to account for discount shorter sentences General model: for each sequence of words of length n, define p’(w) = n p(w) such that  n=1..  n = 1    n=1..   w  n p ’ (w)=1 e.g., estimate n from data; or use normal or other distribution

10 Parameter Estimation Parameter: numerical value needed to compute p(w|h) From data (how else?) Data preparation: get rid of formatting etc. ( “ text cleaning ” ) define words (separate but include punctuation, call it “ word ” ) define sentence boundaries (insert “ words ” and ) letter case: keep, discard, or be smart: –name recognition –number type identification [these are huge problems per se!] numbers: keep, replace by, or be smart (form ~ pronunciation)

11 Maximum Likelihood Estimate MLE: Relative Frequency... –...best predicts the data at hand (the “ training data ” ) Trigrams from Training Data T: –count sequences of three words in T: c 3 (w i-2,w i-1,w i ) –[NB: notation: just saying that the three words follow each other] –count sequences of two words in T: c 2 (w i-1,w i ): either use c 2 (y,z) =  w c 3 (y,z,w) or count differently at the beginning (& end) of data! p(w i |w i-2,w i-1 ) = est. c 3 (w i-2,w i-1,w i ) / c 2 (w i-2,w i-1 ) !

12 Character Language Model Use individual characters instead of words: Same formulas etc. Might consider 4-grams, 5-grams or even more Good only for language comparison Transform cross-entropy between letter- and word-based models: H S (p c ) = H S (p w ) / avg. # of characters/word in S p(W)  df  i=1..d p(c i |c i-n+1,c i-n+2,...,c i-1 )

13 LM: an Example Training data: He can buy the can of soda. –Unigram: p 1 (He) = p 1 (buy) = p 1 (the) = p 1 (of) = p 1 (soda) = p 1 (.) =.125 p 1 ( can ) =.25 –Bigram: p 2 ( He| ) = 1, p 2 ( can|He ) = 1, p 2 ( buy|can ) =.5, p 2 ( of|can ) =.5, p 2 ( the|buy ) = 1,... –Trigram: p 3 ( He|, ) = 1, p 3 ( can|,He ) = 1, p 3 ( buy|He,can ) = 1, p 3 ( of|the,can ) = 1,..., p 3 (.|of,soda ) = 1. –Entropy: H(p 1 ) = 2.75, H(p 2 ) =.25, H(p 3 ) = 0  Great?!

14 LM: an Example (The Problem) Cross-entropy: S = It was the greatest buy of all. Even H S (p 1 ) fails (= H S (p 2 ) = H S (p 3 ) =  ), because: –all unigrams but p 1 (the), p 1 (buy), p 1 (of) and p 1 (.) are 0. –all bigram probabilities all 0. –all trigram probabilities all 0. We want: to make all (theoretically possible * ) probabilities non-zero. * in fact, all: remember our graph from day 1?