CMU-Statistical Language Modeling & SRILM Toolkits Ashwin Acharya Ece 5527 Search and Decoding Instructor: Dr. v. Këpuska
Objective Use the SRILM and CMU (Carnegie Mellon University) toolkits to built differing language models. Use the four Discounting Strategies to build the language models. Perform perplexity tests with both LM toolkits to study each ones performance.
CMU-SLM Toolkit The CMU SLM Toolkit is a set of Unix software tools created to aid statistical language modeling. The key tools provided are used to process textual data into: Vocabularies and word frequency lists Bigram and trigram counts N-gram related statistics Various N-gram language model Once the language models are created they can be used to compute: Perplexity Out-of-vocabulary rate N-gram hit ratios Back off distributions
SRILM Toolkit SRILM toolkit is used for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation SRILM consists of the following components: A set of C++ class libraries implementing language models, supporting data structures and miscellaneous utility functions. A set of executable programs built on top of these libraries to perform standard tasks such as training LMs and testing them on data, tagging or segmenting text. A collection of miscellaneous scripts facilitating minor related tasks. SRILM runs on UNIX platforms
Toolkit Environment The two toolkits run in a Unix Environment like Linux However, Cygwin works well and can be used to simulate this Unix environment on a windows platform. Download free Cygwin form following link: http://www.cygwin.com/
Cygwin Install Download the Cygwin installation file Execute setup.exe Choose “Install from Internet” Select root install directory “ C:\cygwin” Choose a download site “mirrors”
Cygwin Install Make sure all of the following packages are selected during the install process: gcc: the compiler GNU make: utility, which determines automatically which pieces of a large program need to be recompiled, and issues the commands to recompile them. TCL toolkit: tool command language Tcsh: TENEX C shell gzip: to read/write compressed file GNU awk(gawk): to interpret many of the utility script Binults: GNU assembler, linker and binary utilities
SRILM Install Download SRILM toolkit, srilm.tgz Run Cygwin Unzip srilm.tgz by following commands: $ cd /cygdrive/c/cygwin/srilm $ tar zxvf srilm.tgz Add following lines to setup the direction in the makefile: SRILM=/cygdrive/c/cygwin/srilm MACHINE_TYPE=cygwin Run Cygwin and type following command to install SRILM: $ Make World
CMU-SLM Install Download CMU toolkit,CMU-Cam_Toolkit_v2.tgz Run Cygwin Unzip srilm.tgz by following commands: $ cd /cygdrive/c/cygwin/CMU $ tar zxvf CMU-Cam_Toolkit_v2.tgz Run Cygwin and type following command to install CMU: $ Make All
Training Corpus Use the corpus provided by National Language Toolkit website. http://en.sourceforge.jp/projects/sfnet_nltk/downloads/nltk-lite/0.9/nltk-data-0.9.zip/ Containing a lexicon and two different corpora from the Australian Broadcasting Commission. The first is a Rural News based corpora The second is a Science News based corpora
SRILM Procedure Generate 3-gram count file by using following commands: $./ngram-count -vocab abc/en.vocab -text abc/rural.txt -order 3 -write abc/rural.count -unk
Count File ngram-count count N-grams and estimate language models -vocab file Read a vocabulary from file. Subsequently, out-of-vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary. -text textfile Generate N-gram counts from text file. textfile should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored. -order n Set the maximal order (length) of N-grams to count. This also determines the order of the estimated LM, if any. The default order is 3. -write file Write total counts to file. -unk Build an “open vocabulary” LM, i.e., one that contains the unknown-word token as a regular word. The default is to remove the unknown word.
SRILM Procedure Creating the language model : $ ./ngram-count -read abc/rural.count -order 3 -lm abc/rural.lm -read countsfile Read N-gram counts from a file. ASCII count files contain one N-gram of words per line, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. -lm lmfile Estimate a backoff N-gram model from the total counts, and write it to lmfile.
Discounting Good-Turing Discount -gtnmin count $ ./ngram-count -read abc/rural.count -order 3 -lm abc/gtrural.lm -gt1min 1 -gt1max 3 -gt2min 1 -gt2max 3 -gt3min 1 -gt3max 3 -gtnmin count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the minimal count of N-grams of order n that will be included in the LM. All N-grams with frequency lower than that will effectively be discounted to 0. If n is omitted the parameter for N-grams of order > 9 is set. NOTE: This option affects not only the default Good-Turing discounting but the alternative discounting methods described below as well. -gtnmax count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the maximal count of N-grams of order n that are discounted under Good-Turing. All N-grams more frequent than that will receive maximum likelihood estimates. Discounting can be effectively disabled by setting this to 0. If n is omitted the parameter for N-grams of order > 9 is set.
Discounting Absolute Discounting Witten-Bell -cdiscountn discount $ ./ngram-count -read abc/test1.txt -order 3 -lm abc/ruralcd.lm -cdiscount1 0.5 -cdiscount2 0.5 -cdiscount3 0.5 -cdiscountn discount where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Ney's absolute discounting for N-grams of order n, using discount as the constant to subtract. Witten-Bell $ ./ngram-count -read abc/test1.txt -order 3 -lm abc/ruralwb.lm -wbdiscount1 -wbdiscount2 -wbdiscount3 -wbdiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Witten-Bell discounting for N-grams of order n. (This is the estimator where the first occurrence of each word is taken to be a sample for the “unseen” event.)
Discounting Kneser-Ney Discounting -kndiscountn $ ./ngram-count -read project/count.txt -order 3 -lm knlm.txt -kndiscount1 -kndiscount2 -kndiscount3 -kndiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Chen and Goodman's modified Kneser-Ney discounting for N-grams of order n.
Perplexity Perplexity is the common evaluation metric for N-gram language models. It is relative cheaper than the end-to-end evaluation which requires to test the recognition results of language models. The perplexity is defined as following: The better language model will assign a higher probability to the test data which lower the perplexity. As the result, the lower the perplexity means the higher probability of the test set according to that specific language model. Normalized by number of words. probability of the test set
Test Perplexity Randomly choose three articles of news from Internet for test data. Commands for four different 3-gram language Models: $ ./ngram -ppl abc/test1.txt -order 3 -lm project/gtrural.lm $ ./ngram -ppl abc/test1.txt -order 3 -lm project/cdrural.lm $ ./ngram -ppl abc/test1.txt -order 3 -lm project/wbrural.lm $ ./ngram -ppl abc/test1.txt -order 3 -lm project/knrural.lm Repeat these tests for all three test data with the other language models
Results of Perplexity Test SRILM Perplexity test with test2.txt with language model gtrural.lm ( rural corpora with Good-Turing Discounting)
Building LM with CMU SLM toolkit The following is a simple flow chart which explains the steps to build a language model using CMU language model toolkit.
Building LM with CMU SLM toolkit text2wfreq: compute the word unigram counts. wfreq2vocab: convert the word unigram counts into a vocabulary consisting of the preset number of common words. text2idngram: Convert the idngram into a binary format language model. idngram2lm: Build the language model. evallm: compute the perplexity of the test corpus.
CMU SLM Procedure Given a large corpus of text in a file a.text, but no specified vocabulary Compute the word unigram counts cat a.text | text2wfreq > a.wfreq Convert the word unigram counts into a vocabulary consisting of the 20,000 most common words cat a.wfreq | wfreq2vocab -top 20000 > a.vocab Generate a binary id 3-gram of the training text, based on this vocabulary cat a.text | text2idngram -vocab a.vocab > a.idngram Convert the idngram into a binary format language model idngram2lm -idngram a.idngram -vocab a.vocab -binary a.binlm Compute the perplexity of the language model, with respect to some test text b.text evallm -binary a.binlm Reading in language model from file a.binlm Done. evallm : perplexity -text b.text Computing perplexity of the language model with respect to the text b.text Perplexity = 128.15, Entropy = 7.00 bits Computation based on 8842804 words. Number of 3-grams hit = 6806674 (76.97%) Number of 2-grams hit = 1766798 (19.98%) Number of 1-grams hit = 269332 (3.05%) 1218322 OOVs (12.11%) and 576763 context cues were removed from the calculation. evallm : quit
Perplexity Test Results CMU Perplexity test with test1.txt with language model ruralgt.lm ( rural corpora with Good-Turing Discounting)
Perplexity Results Rural Corpus Test Discounting Good-Turing Absolute Absolute Witten-Bell Kneser-Ney SRILM CMU-LM test 1 383.84 978.88 411.84 1011.08 390.46 1156.11 391.48 1123.38 test 2 356.32 666.3 352.89 680.38 344.05 796.11 343.51 784.25 test 3 895.6 1656.68 910.79 1699.01 834.36 2039.16 726.24 1835.67 Science Corpus 471.83 811.02 475.37 821.47 451.8 976.16 483.13 933.58 83.82 87.3 68.14 87.14 72.08 78.07 48.77 108.75 766.38 1501.31 836.04 1512.02 796.51 1885.12 733.17 1743.37
Results Perplexity tests based on the language model from the Rural News Corpora
Results Perplexity tests based on the language model from the Science News Corpora
Conclusion It can clearly be seen the SRILM toolkit performed the better that the CMU toolkit in terms of the perplexity tests. We can also notice that for specific test corpora certain discounting strategies were better suited that the others. For instance the Kneser-Ney discounting seemed to perform relatively better than the others on most of the tests. The CMU language models perplexity results seemed to be a little on the higher side a fix could be to adjust the discounting ranges. Results could have been improved by using some sort of word tokenization.
References SRI International, “The SRI Language Modeling Toolkit”, http://www.speech.sri.com/projects/srilm/ Cygwin Information and Installation, “Installing and Updating Cygwin”, http://www.cygwin.com/ National Language Toolkit, http://nltk.sourceforge.net/index.php/Corpora Carnegie Mellon University, CMU Statistical language Model Toolkit, http://www.speech.cs.cmu.edu