CMU-Statistical Language Modeling & SRILM Toolkits

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
A new framework for Language Model Training David Huggins-Daines January 19, 2006.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Almost-Spring Short Course on Speech Recognition Instructors: Bhiksha Raj and Rita Singh Welcome.
Factored Language Models EE517 Presentation April 19, 2005 Kevin Duh
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.
Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Introduction to NS2 -Network Simulator- -Prepared by Changyong Jung.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Statistical Language Modeling using SRILM Toolkit
Building a Statistical Language Model Using CMUCLMTK
Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595.
CSC 215 : Procedural Programming with C C Compilers.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Digital Speech Processing Homework 3
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Getting Started with MATLAB 1. Fundamentals of MATLAB 2. Different Windows of MATLAB 1.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
SRILM - The SRI Language Modeling Toolkit
Setting up Cygwin Computer Organization I 1 May 2010 ©2010 McQuain Cygwin: getting the setup tool Free, almost complete UNIX environment emulation.
What is a port The Ports Collection is essentially a set of Makefiles, patches, and description files placed in /usr/ports. The port includes instructions.
Lecture 4 Ngrams Smoothing
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
Digital Speech Processing HW3
Natural Language Processing Statistical Inference: n-grams
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Digital Speech Processing Homework /05/04 王育軒.
 CSC 215 : Procedural Programming with C C Compilers.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
Cygwin: getting the setup tool
CSC 215 : Procedural Programming with C
Prof: Dr. Shu-Ching Chen TA: Hsin-Yu Ha Fall 2015
Tools for Natural Language Processing Applications
Digital Speech Processing
Prof: Dr. Shu-Ching Chen TA: Yimin Yang
Language-Model Based Text-Compression
Prof: Dr. Shu-Ching Chen TA: Hsin-Yu Ha
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Cygwin.
Cygwin: getting the setup tool
Presentation transcript:

CMU-Statistical Language Modeling & SRILM Toolkits Ashwin Acharya Ece 5527 Search and Decoding Instructor: Dr. v. Këpuska

Objective Use the SRILM and CMU (Carnegie Mellon University) toolkits to built differing language models. Use the four Discounting Strategies to build the language models. Perform perplexity tests with both LM toolkits to study each ones performance.

CMU-SLM Toolkit The CMU SLM Toolkit is a set of Unix software tools created to aid statistical language modeling. The key tools provided are used to process textual data into: Vocabularies and word frequency lists Bigram and trigram counts N-gram related statistics Various N-gram language model Once the language models are created they can be used to compute: Perplexity Out-of-vocabulary rate N-gram hit ratios Back off distributions

SRILM Toolkit SRILM toolkit is used for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation SRILM consists of the following components: A set of C++ class libraries implementing language models, supporting data structures and miscellaneous utility functions. A set of executable programs built on top of these libraries to perform standard tasks such as training LMs and testing them on data, tagging or segmenting text. A collection of miscellaneous scripts facilitating minor related tasks. SRILM runs on UNIX platforms

Toolkit Environment The two toolkits run in a Unix Environment like Linux However, Cygwin works well and can be used to simulate this Unix environment on a windows platform. Download free Cygwin form following link: http://www.cygwin.com/

Cygwin Install Download the Cygwin installation file Execute setup.exe Choose “Install from Internet” Select root install directory “ C:\cygwin” Choose a download site “mirrors”

Cygwin Install Make sure all of the following packages are selected during the install process: gcc: the compiler GNU make: utility, which determines automatically which pieces of a large program need to be recompiled, and issues the commands to recompile them. TCL toolkit: tool command language Tcsh: TENEX C shell gzip: to read/write compressed file GNU awk(gawk): to interpret many of the utility script Binults: GNU assembler, linker and binary utilities

SRILM Install Download SRILM toolkit, srilm.tgz Run Cygwin Unzip srilm.tgz by following commands: $ cd /cygdrive/c/cygwin/srilm $ tar zxvf srilm.tgz Add following lines to setup the direction in the makefile: SRILM=/cygdrive/c/cygwin/srilm MACHINE_TYPE=cygwin Run Cygwin and type following command to install SRILM: $ Make World

CMU-SLM Install Download CMU toolkit,CMU-Cam_Toolkit_v2.tgz Run Cygwin Unzip srilm.tgz by following commands: $ cd /cygdrive/c/cygwin/CMU $ tar zxvf CMU-Cam_Toolkit_v2.tgz Run Cygwin and type following command to install CMU: $ Make All

Training Corpus Use the corpus provided by National Language Toolkit website. http://en.sourceforge.jp/projects/sfnet_nltk/downloads/nltk-lite/0.9/nltk-data-0.9.zip/ Containing a lexicon and two different corpora from the Australian Broadcasting Commission. The first is a Rural News based corpora The second is a Science News based corpora

SRILM Procedure Generate 3-gram count file by using following commands: $./ngram-count -vocab abc/en.vocab -text abc/rural.txt -order 3 -write abc/rural.count -unk

Count File ngram-count count N-grams and estimate language models -vocab file Read a vocabulary from file. Subsequently, out-of-vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary. -text textfile Generate N-gram counts from text file. textfile should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored. -order n Set the maximal order (length) of N-grams to count. This also determines the order of the estimated LM, if any. The default order is 3. -write file Write total counts to file. -unk Build an “open vocabulary” LM, i.e., one that contains the unknown-word token as a regular word. The default is to remove the unknown word.

SRILM Procedure Creating the language model : $ ./ngram-count -read abc/rural.count -order 3 -lm abc/rural.lm -read countsfile Read N-gram counts from a file. ASCII count files contain one N-gram of words per line, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. -lm lmfile Estimate a backoff N-gram model from the total counts, and write it to lmfile.

Discounting Good-Turing Discount -gtnmin count $ ./ngram-count -read abc/rural.count -order 3 -lm abc/gtrural.lm -gt1min 1 -gt1max 3 -gt2min 1 -gt2max 3 -gt3min 1 -gt3max 3 -gtnmin count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the minimal count of N-grams of order n that will be included in the LM. All N-grams with frequency lower than that will effectively be discounted to 0. If n is omitted the parameter for N-grams of order > 9 is set. NOTE: This option affects not only the default Good-Turing discounting but the alternative discounting methods described below as well. -gtnmax count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the maximal count of N-grams of order n that are discounted under Good-Turing. All N-grams more frequent than that will receive maximum likelihood estimates. Discounting can be effectively disabled by setting this to 0. If n is omitted the parameter for N-grams of order > 9 is set.

Discounting Absolute Discounting Witten-Bell -cdiscountn discount $ ./ngram-count -read abc/test1.txt -order 3 -lm abc/ruralcd.lm -cdiscount1 0.5 -cdiscount2 0.5 -cdiscount3 0.5 -cdiscountn discount where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Ney's absolute discounting for N-grams of order n, using discount as the constant to subtract. Witten-Bell $ ./ngram-count -read abc/test1.txt -order 3 -lm abc/ruralwb.lm -wbdiscount1 -wbdiscount2 -wbdiscount3 -wbdiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Witten-Bell discounting for N-grams of order n. (This is the estimator where the first occurrence of each word is taken to be a sample for the “unseen” event.)

Discounting Kneser-Ney Discounting -kndiscountn $ ./ngram-count -read project/count.txt -order 3 -lm knlm.txt -kndiscount1 -kndiscount2 -kndiscount3 -kndiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Chen and Goodman's modified Kneser-Ney discounting for N-grams of order n.

Perplexity Perplexity is the common evaluation metric for N-gram language models. It is relative cheaper than the end-to-end evaluation which requires to test the recognition results of language models. The perplexity is defined as following: The better language model will assign a higher probability to the test data which lower the perplexity. As the result, the lower the perplexity means the higher probability of the test set according to that specific language model. Normalized by number of words. probability of the test set

Test Perplexity Randomly choose three articles of news from Internet for test data. Commands for four different 3-gram language Models: $ ./ngram -ppl abc/test1.txt -order 3 -lm project/gtrural.lm $ ./ngram -ppl abc/test1.txt -order 3 -lm project/cdrural.lm $ ./ngram -ppl abc/test1.txt -order 3 -lm project/wbrural.lm $ ./ngram -ppl abc/test1.txt -order 3 -lm project/knrural.lm Repeat these tests for all three test data with the other language models

Results of Perplexity Test SRILM Perplexity test with test2.txt with language model gtrural.lm ( rural corpora with Good-Turing Discounting)

Building LM with CMU SLM toolkit The following is a simple flow chart which explains the steps to build a language model using CMU language model toolkit.

Building LM with CMU SLM toolkit text2wfreq: compute the word unigram counts. wfreq2vocab: convert the word unigram counts into a vocabulary consisting of the preset number of common words. text2idngram: Convert the idngram into a binary format language model. idngram2lm: Build the language model. evallm: compute the perplexity of the test corpus.

CMU SLM Procedure Given a large corpus of text in a file a.text, but no specified vocabulary Compute the word unigram counts cat a.text | text2wfreq > a.wfreq Convert the word unigram counts into a vocabulary consisting of the 20,000 most common words cat a.wfreq | wfreq2vocab -top 20000 > a.vocab Generate a binary id 3-gram of the training text, based on this vocabulary cat a.text | text2idngram -vocab a.vocab > a.idngram Convert the idngram into a binary format language model idngram2lm -idngram a.idngram -vocab a.vocab -binary a.binlm Compute the perplexity of the language model, with respect to some test text b.text evallm -binary a.binlm Reading in language model from file a.binlm Done. evallm : perplexity -text b.text Computing perplexity of the language model with respect to the text b.text Perplexity = 128.15, Entropy = 7.00 bits Computation based on 8842804 words. Number of 3-grams hit = 6806674 (76.97%) Number of 2-grams hit = 1766798 (19.98%) Number of 1-grams hit = 269332 (3.05%) 1218322 OOVs (12.11%) and 576763 context cues were removed from the calculation. evallm : quit

Perplexity Test Results CMU Perplexity test with test1.txt with language model ruralgt.lm ( rural corpora with Good-Turing Discounting)

Perplexity Results Rural Corpus Test Discounting Good-Turing Absolute   Absolute Witten-Bell Kneser-Ney SRILM CMU-LM test 1 383.84 978.88 411.84 1011.08 390.46 1156.11 391.48 1123.38 test 2 356.32 666.3 352.89 680.38 344.05 796.11 343.51 784.25 test 3 895.6 1656.68 910.79 1699.01 834.36 2039.16 726.24 1835.67 Science Corpus 471.83 811.02 475.37 821.47 451.8 976.16 483.13 933.58 83.82 87.3 68.14 87.14 72.08 78.07 48.77 108.75 766.38 1501.31 836.04 1512.02 796.51 1885.12 733.17 1743.37

Results Perplexity tests based on the language model from the Rural News Corpora

Results Perplexity tests based on the language model from the Science News Corpora

Conclusion It can clearly be seen the SRILM toolkit performed the better that the CMU toolkit in terms of the perplexity tests. We can also notice that for specific test corpora certain discounting strategies were better suited that the others. For instance the Kneser-Ney discounting seemed to perform relatively better than the others on most of the tests. The CMU language models perplexity results seemed to be a little on the higher side a fix could be to adjust the discounting ranges. Results could have been improved by using some sort of word tokenization.

References SRI International, “The SRI Language Modeling Toolkit”, http://www.speech.sri.com/projects/srilm/   Cygwin Information and Installation, “Installing and Updating Cygwin”, http://www.cygwin.com/ National Language Toolkit, http://nltk.sourceforge.net/index.php/Corpora Carnegie Mellon University, CMU Statistical language Model Toolkit, http://www.speech.cs.cmu.edu