A new framework for Language Model Training David Huggins-Daines January 19, 2006
Overview Current tools Requirements for new framework User Interface Examples Design and API
Current status of LM training The CMU SLM toolkit Efficient implementation of basic algorithms Doesn’t handle all tasks of building a LM Text normalization Vocabulary selection Interpolation/adaptation Requires an expert to “put the pieces together” Lots of scripts SimpleLM, Communicator, CALO, etc. Other LM toolkits SRILM, Lemur, others?
Requirements LM training should be Repeatable An “end-to-end” rebuild should produce the same result Configurable It should be easy to change parameters and rebuild the entire model to see their effect Flexible Should support many types of source texts, methods of training Extensible Modular structure to allow new methods and data sources to be easily implemented
Tasks of building an LM Normalize source texts They come in many different formats! LM toolkit expects a stream of words What is a “word”? Compound words, acronyms Non-lexemes (filler words, pauses, disfluencies) What is a “sentence”? Segmentation of input data Annotate source texts with class tags Select a vocabulary Determine optimal vocabulary size Collect words from training texts Define vocabulary classes Vocabulary closure Build a dictionary (pronunciation modeling)
Tasks, continued Estimate N-Gram model(s) Choose the appropriate smoothing parameters Find the appropriate divisions of the training set Interpolate N-Gram models Use a held-out set representative of the test set Find weights for different models which maximize likelihood (minimize perplexity) on this domain Evaluate language model Jointly minimize perplexity and OOV rate (they tend to move in opposite directions)
A Simple Switchboard Example Top level tag - must be only one A set of transcripts The input filter to use A list of files Exclude singletons Backreference to named object
A More Complicated Example swb.test.lsn icsi.test.mrt BRAZIL cmu.test.trs (Interpolation of ICSI and Switchboard) Vocabularies can be nested (merged) Files can be listed directly in element contents Words can be listed directly in element contents Held-out set for interpolation Interpolate previously named LMs
Command-line Interface lm_train “Runs” an XML configuration file build_vocab Build vocabularies, normalize transcripts ngram_train Train individual N-Gram models ngram_test Evaluate N-Gram models ngram_interpolate Interpolate and combine N-Gram models ngram_pronounce Build a pronunciation lexicon from a language model or vocabulary
Programming Interface NGramFactory Builds an NGramModel from an XML specification (as seen previously) NGramModel Trains a single N-Gram LM from some transcripts Vocabulary Builds a vocabulary from transcripts or other vocabularies InputFilter Subclassed into InputFilter::CMU, InputFilter::ICSI, InputFilter::HUB5, InputFilter::ISL, etc Reads transcripts in some format and outputs a word stream
Design in Plain English NGramFactory builds an NGramModel NGramModel has a Vocabulary NGramModel and Vocabulary can have Transcripts NGramModel and Vocabulary use an InputFilter (or maybe they don’t) NGramModel can merge two other NGramModel s using a set of Transcripts Vocabulary can merge another Vocabulary
A very simple InputFilter use strict; package InputFilter::Simple; require InputFilter; use base 'InputFilter'; sub process_transcript { my ($self, $file) local ($_, *FILE); open FILE, "<$file" or die "Failed to open $file: $!"; while ( ) { chomp; = split; } 1; (InputFilter/Simple.pm) please !!! Subclass of InputFilter (This is just good practice) Pass each sentence to this method Read the input file Tokenize, normalize, etc
Where to get it Currently in CVS on fife.speech :ext:fife.speech.cs.cmu.edu:/home/CVS module LMTraining Future: CPAN and cmusphinx.org Possibly integrated with the CMU SLM toolkit in the future
Stuff TODO Class LM support Communicator-style class tags are recognized and supported NGramModel will build.lmctl and.probdef files However this requires normalizing the files to a transcript first, then running the semi-automatic Communicator tagger Automatic tagging would be nice… Support for languages other than English Text normalization conventions Word segmentation (for Asian languages) Character set support (case conversions etc) Unicode (also a CMU-SLM problem)
Questions?