Factored Language Models EE517 Presentation April 19, 2005 Kevin Duh

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Language Modeling.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Natural Language Understanding
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
1 Advanced Smoothing, Evaluation of Language Models.
Introduction to language modeling
12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.
CMU-Statistical Language Modeling & SRILM Toolkits
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran.
1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.
11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Graphical models for part of speech tagging
March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
Yuya Akita , Tatsuya Kawahara
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Natural Language Processing Statistical Inference: n-grams
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Tools for Natural Language Processing Applications
N-Gram Model Formulas Word sequences Chain rule of probability
CS4705 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Presentation transcript:

Factored Language Models EE517 Presentation April 19, 2005 Kevin Duh

Factored Language Models1 Outline 1.Motivation 2.Factored Word Representation 3.Generalized Parallel Backoff 4.Model Selection Problem 5.Applications 6.Tools

Factored Language Models2 Word-based Language Models Standard word-based language models How to get robust n-gram estimates ( )? Smoothing E.g. Kneser-Ney, Good-Turing Class-based language models

Factored Language Models3 Limitation of Word-based Language Models Words are inseparable whole units. E.g. “book” and “books” are distinct vocabulary units Especially problematic in morphologically-rich languages: E.g. Arabic, Finnish, Russian, Turkish Many unseen word contexts High out-of-vocabulary rate High perplexity Arabic k-t-b KitaabA book Kitaab-iyMy book Kitaabu-humTheir book KutubBooks

Factored Language Models4 Arabic Morphology root pattern LIVE + past + 1st-sg-past + part: “so I lived” -tufa- affixesparticles sakansakan ~5000 roots several hundred patterns dozens of affixes

Factored Language Models5 Vocabulary Growth - full word forms Source: K. Kirchhoff, et al., “Novel Approaches to Arabic Speech Recognition - Final Report from the JHU Summer Workshop 2002”, JHU Tech Report 2002

Factored Language Models6 Vocabulary Growth - stemmed words Source: K. Kirchhoff, et al., “Novel Approaches to Arabic Speech Recognition - Final Report from the JHU Summer Workshop 2002”, JHU Tech Report 2002

Factored Language Models7 Solution: Word as Factors Decompose words into “factors” (e.g. stems) Build language model over factors: P(w|factors) Two approaches for decomposition Linear [e.g. Geutner, 1995] Parallel [Kirchhoff et. al., JHU Workshop 2002] [Bilmes & Kirchhoff, NAACL/HLT 2003] WtWt W t-2 W t-1 StSt S t-2 S t-1 MtMt M t-2 M t-1 stemsuffixprefixsuffixstem

Factored Language Models8 Factored Word Representations Factors may be any word feature. Here we use morphological features: E.g. POS, stem, root, pattern, etc. WtWt W t-2 W t-1 StSt S t-2 S t-1 MtMt M t-2 M t-1

Factored Language Models9 Advantage of Factored Word Representations Main advantage: Allows robust estimation of probabilities (i.e. ) using backoff Word combinations in context may not be observed in training data, but factor combinations are Simultaneous class assignment Word Kutub (Books) Kitaab-iy (My book) Kitaabu-hum (Their book)

Factored Language Models10 Example Training sentence: “lAzim tiqra kutubiy bi sorca” (You have to read my books quickly) Test sentence: “lAzim tiqra kitAbiy bi sorca” (You have to read my book quickly) Count(tiqra, kitAbiy, bi) = 0 Count(tiqra, kutubiy, bi) > 0 Count(tiqra, ktb, bi) > 0 P(bi| kitAbiy, tiqra) can back off to P(bi | ktb, tiqra) to obtain more robust estimate. => this is better than P(bi |, tiqra)

Factored Language Models11 Language Model Backoff When n-gram count is low, use (n-1)-gram estimate Ensures more robust parameter estimation in sparse data: P(W t | W t-1 W t-2 ) P(W t ) P(W t | W t-1 ) P(W t | W t-1 W t-2 W t-3 ) Backoff path: Drop most distant word during backoff Word-based LM: Backoff graph: multiple backoff paths possible F | F 1 F 2 F 3 F | F 1 F 3 F | F 2 F | F 1 F 2 F | F 2 F 3 F | F 3 F | F 1 F Factored Language Model:

Factored Language Models12 Choosing Backoff Paths Four methods for choosing backoff path 1.Fixed path (a priori) 2.Choose path dynamically during training 3.Choose multiple paths dynamically during training and combine result (Generalized Parallel Backoff) 4.Constrained version of (2) or (3) F | F 1 F 2 F 3 F | F 1 F 3 F | F 2 F | F 1 F 2 F | F 2 F 3 F | F 3 F | F 1 F

Factored Language Models13 Generalized Backoff Katz Backoff: Generalized Backoff: g() can be any positive function, but some g() makes backoff weight computation difficult

Factored Language Models14 g() functions A priori fixed path: Dynamic path: Max counts: Dynamic path: Max normalized counts: Based on raw counts => Favors robust estimation Based on maximum likelihood => Favors statistical predictability

Factored Language Models15 Dynamically Choosing Backoff Paths During Training Choose backoff path based based on g() and statistics of the data W t | W t-1 S t-1 T t-1 W t | S t-1 T t-1 W t | W t-1 W t | W t-1 S t-1 W t | W t-1 T t-1 W t | T t-1 W t | S t-1 WtWt W t | W t-1 S t-1 T t-1 W t | W t-1 S t-1 W t | W t-1 S t-1 T t-1 W t | S t-1 W t | W t-1 S t-1 W t | W t-1 S t-1 T t-1 WtWt W t | S t-1 W t | W t-1 S t-1 W t | W t-1 S t-1 T t-1

Factored Language Models16 Multiple Backoff Paths: Generalized Parallel Backoff Choose multiple paths during training and combine probability estimates W t | W t-1 S t-1 T t-1 W t | S t-1 T t-1 W t | W t-1 S t-1 W t | W t-1 T t-1 W t | W t-1 S t-1 T t-1 W t | W t-1 S t-1 W t | W t-1 S t-1 T t-1 W t | W t-1 T t-1 Options for combination are: - average, sum, product, geometric mean, weighted mean

Factored Language Models17 Summary: Factored Language Models FACTORED LANGUAGE MODEL = Factored Word Representation + Generalized Backoff Factored Word Representation Allows rich feature set representation of words Generalized (Parallel) Backoff Enables robust estimation of models with many conditioning variables

Factored Language Models18 Model Selection Problem In n-grams, choose, eg. Bigram vs. trigram vs. 4gram => relatively easy search; just try each and note perplexity on development set In Factored LM, choose: Initial Conditioning Factors Backoff Graph Smoothing Options  Too many options; need automatic search  Tradeoff: Factored LM is more general, but harder to select a good model that fits data well.

Factored Language Models19 Example: a Factored LM Initial Conditioning Factors, Backoff Graph, and Smoothing parameters completely specify a Factored Language Model E.g. 3 factors total: W t | W t-1 S t-1 T t-1 W t | S t-1 T t-1 W t | W t-1 W t | W t-1 S t-1 W t | W t-1 T t-1 W t | T t-1 W t | S t-1 WtWt 0. Begin with full graph structure for 3 factors W t | W t-1 S t-1 1. Initial Factors specify start-node

Factored Language Models20 Example: a Factored LM Initial Conditioning Factors, Backoff Graph, and Smoothing parameters completely specify a Factored Language Model E.g. 3 factors total: W t | W t-1 W t | W t-1 S t-1 W t | S t-1 WtWt 3. Begin with subgraph obtained with new root node 4. Specify backoff graph: i.e. what backoff to use at each node W t | W t-1 W t | W t-1 S t-1 WtWt 5. Specify smoothing for each edge

Factored Language Models21 Applications for Factored LM Modeling of Arabic, Turkish, Finnish, German, and other morphologically-rich languages [Kirchhoff, et. al., JHU Summer Workshop 2002] [Duh & Kirchhoff, Coling 2004], [Vergyri, et. al., ICSLP 2004] Modeling of conversational speech [Ji & Bilmes, HLT 2004] Applied in Speech Recognition, Machine Translation General Factored LM tools can also be used to obtain various smoothed conditional probability tables for other applications outside of language modeling (e.g. tagging) More possibilities (factors can be anything!)

Factored Language Models22 To explore further… Factored Language Model is now part of the standard SRI Language Modeling Toolkit distribution (v.1.4.1) Thanks to Jeff Bilmes (UW) and Andreas Stolcke (SRI) Downloadable at:

Factored Language Models23 fngram Tools fngram-count -factor-file my.flmspec -text train.txt fngram -factor-file my.flmspec -ppl test.txt train.txt: “Factored LM is fun” W-Factored:P-adj W-LM:P-noun W-is:P-verb W-fun:P-adj my.flmspec W: 2 W(-1) P(-1) my.count my.lm 3 W1,P1 W1 kndiscount gtmin 1 interpolate P1 P1 kndiscount gtmin kndiscount gtmin 1

Factored Language Models24

Factored Language Models25 Turkish Language Model Newspaper text from web [Hakkani-Tür, 2000] Train: 400K tokens / Dev: 100K / Test: 90K Factors from morphological analyzer Word yararmanlak

Factored Language Models26 Turkish: Dev Set Perplexity NgramWord- based LM Hand FLM Random FLM Genetic FLM  ppl(%) Factored Language Models found by Genetic Algorithms perform best Poor performance of high order Hand-FLM corresponds to difficulty manual search

Factored Language Models27 Turkish: Eval Set Perplexity NgramWord- based LM Hand FLM Random FLM Genetic FLM  ppl(%) Dev Set results generalizes to Eval Set => Genetic Algorithms did not overfit Best models used Word, POS, Case, Root factors and parallel backoff

Factored Language Models28 Arabic Language Model LDC CallHome Conversational Egyptian Arabic speech transcripts Train: 170K words / Dev: 23K / Test: 18K Factors from morphological analyzer [LDC,1996], [Darwish, 2002] Word Il+dOr

Factored Language Models29 Arabic: Dev Set and Eval Set Perplexity NgramWord- based LM Hand FLM Random FLM Genetic FLM  ppl(%) The best models used all available factors (Word, Stem, Root, Pattern, Morph), and various parallel backoffs NgramWordHandRandomGenetic  ppl(%) Dev Set perplexities Eval Set perplexities

Factored Language Models30 Word Error Rate (WER) Results Dev Set StageWord LM Baseline Factored LM a b Eval Set (eval 97) Word LM Baseline Factored LM Factored language models gave 1.5% improvement in WER