1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.
A* Search Uses evaluation function f (n)= g(n) + h(n) where n is a node. – g is a cost function Total cost incurred so far from initial state at node n.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
FSA and HMM LING 572 Fei Xia 1/5/06.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
Finite state automaton (FSA)
Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.
LING 438/538 Computational Linguistics
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
1 Finite state automaton (FSA) LING 570 Fei Xia Week 2: 10/07/09 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Language Models Data-Intensive Information Processing Applications ― Session #9 Nitin Madnani University of Maryland Tuesday, April 6, 2010 This work is.
Introduction to information theory
Albert Gatt Corpora and Statistical Methods Lecture 9.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences.
CMU-Statistical Language Modeling & SRILM Toolkits
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
NLP. Introduction to NLP Extrinsic –Use in an application Intrinsic –Cheaper Correlate the two for validation purposes.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Language Modeling 1. Roadmap (for next two classes)  Review LMs  What are they?  How (and where) are they used?  How are they trained?  Evaluation.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.
N-gram Models CMSC Artificial Intelligence February 24, 2005.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Estimating N-gram Probabilities Language Modeling.
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Natural Language Processing Statistical Inference: n-grams
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Language Model for Machine Translation Jang, HaYoung.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presentation transcript:

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A A

2 LM Word prediction: to predict the next word in a sentence. –Ex: I’d like to make a collect ___ Statistical models of word sequences are called language models (LMs). Task: –Build a statistical model from the training data. –Now given a sentence w 1 w 2 … w n, we want to estimate its probability P(w 1 … w n )? Goal: the model should prefer good sentences to bad ones.

3 Some Terms Corpus: a collection of text or speech Words: may or may not include punctuation marks. Types: the number of distinct words in a corpus Tokens: the total number of words in a corpus

4 Applications of LMs Speech recognition –Ex: I bought two/too/to books. Handwriting recognition Machine translation Spelling correction …

5 Outline N-gram LM Evaluation

6 N-gram LM

7 Given a sentence w 1 w 2 … w n, how to estimate its probability P(w 1 … w n )? The Markov independence assumption: P(w n | w 1, …, w n-1 ) depends only on the previous k words. P(w 1 … w n ) = P(w 1 ) * P(w 2 |w 1 )* … P(w n | w 1, …, w n-1 ) ¼ P(w 1 )*P(w 2 |w 1 )* … P(w n | w n-k+1, …, w n-1 ) 0 th order Markov model: unigram model 1 st order Markov model: bigram model 2 nd order Markov model: trigram model …

8 Unigram LM P(w 1 … w n ) ¼ P(w 1 )*P(w 2 )* … P(w n ) Estimating P(w): –MLE: P(w) = C(w)/N, N is the num of tokens How many states in the FSA? How many model parameters?

9 Bigram LM P(w 1 … w n ) = P(BOS w 1 … w n EOS) ¼ P(BOS) * P(w 1 |BOS)* … *P(w n | w n-1 )*P(EOS|w n ) Estimating P(w n |w n-1 ): –MLE: P(w n ) = C(w n-1,w n )/C(w n-1 ) How many states in the FSA? How many model parameters?

10 Trigram LM P(w 1 … w n ) = P(BOS w 1 … w n EOS) ¼ P(BOS) * P(w 1 |BOS)* P(w 2 |BOS, w 1 )*… *P(w n | w n-2, w n-1 )*P(EOS | w n-1 w n ) Estimating P(w n |w n-2, w n-1 ): –MLE: P(w n ) = C(w n-2, w n-1,w n )/C(w n-2, w n-1 ) How many states in the FSA? How many model parameters?

11 Text generation Unigram: To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have Bigram: What means, sir. I confess she? then all sorts, he is trim, captain. Trigram: Sweet prince, Falstaff shall die. Harry of Monmouth’s grave. 4-gram: Will you not tell me who I am? It cannot be but so.

12 N-gram LM packages SRI LM toolkit CMU LM toolkit …

13 So far N-grams: –Number of states in FSA: |V| N-1 –Number of model parameters: |V| N Remaining issues: –Data sparse problem  smoothing Unknown words: OOV rate –Mismatch between training and test data  model adaptation –Other LM models: structured LM, class-based LM

14 Evaluation

15 Evaluation (in general) Evaluation is required for almost all CompLing papers. There are many factors to consider: –Data –Metrics –Results of competing systems –…–… You need to think about evaluation from the very beginning.

16 Rules for evaluation Always evaluate your system Use standard metrics Separate training/dev/test data Use standard training/dev/test data Clearly specify experiment setting Include baseline and results from competing systems Perform error analysis Show the system is useful for real applications (optional)

17 Division of data Training data –True training data: to learn model parameters –held-out data: to tune other parameters Development data: used when developing a system. Test data: used only once, for the final evaluation Dividing the data: –Common ratio: 80%, 10%, 10%. –N-fold validation

18 Standard metrics for LM Direct evaluation –Perplexity Indirect evaluation: –ASR –MT –…–…

19 Perplexity Perplexity is based on computing the probabilities of each sentence in the test set. Intuitively, whichever model assigns a higher probability to the test set is a better model.

20 Definition of perplexity Test data T = s 0 … s m Let N be the total number of words in T Lower values mean that the model is better.

21 Perplexity

22 Calculating Perplexity Suppose T consists of m sentences: s 1, …, s m N = word_num + sent_num – oov_num

23 Calculating P(s) Let s = w 1 … w n P(w 1 … w n ) = P(BOS w 1 … w n EOS) = P(w 1 |BOS)*P(w 2 |BOS,w 1 )* … P(w n | w n-2, w n-1 )*P(EOS | w n-1 w n ) If a n-gram contains a unknown word, skip the n-gram (i.e., remove it from the Eq) oov_num ++;

24 Some intuition about perplexity Given a vocabulary V and assume uniform distribution; i.e., P(w) = 1/ |V| The perplexity of any test data T with unigram LM is: Perplexity is a measure of effective “branching factor”.

25 Standard metrics for LM Direct evaluation –Perplexity Indirect evaluation: –ASR –MT –…

26 ASR Word error rate (WER): –System: And he saw apart of the movie –Gold: Andy saw a part of the movie –WER = 3/7

27 Summary N-gram LMs: Evaluation for LM: –Perplexity = /N * lg P(T) = 2 H(L,P) –Indirect measures: WER for ASR, BLEU for MT, etc.

28 Next time Smoothing: J&M Other LMs: class-based LM, structured LM

29 Additional slides

30 Entropy Entropy is a measure of the uncertainty associated with a distribution. The lower bound on the number of bits it takes to transmit messages. An example: –Display the results of horse races. –Goal: minimize the number of bits to encode the results.

31 An example Uniform distribution: p i =1/8. Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64) (0, 10, 110, 1110, , , , )  Uniform distribution has higher entropy.  MaxEnt: make the distribution as “uniform” as possible.

32 Cross Entropy Entropy: Cross Entropy: Cross entropy is a distance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x).

33 Cross entropy of a language The cross entropy of a language L: If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:

34 Perplexity