Arthur Kunkle ECE 5525 Fall 2008
Introduction and Motivation A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert speech data into textual transcriptions. This system will serve as a test-bed for the development of new speech recognition technologies. This design presentation assumes basic knowledge of the tasks an LVSR must accomplish, as well as some in-depth knowledge of the HTK framework.
System Technologies HMM Toolkit (HTK) Cygwin UNIX Emulation Environment Practical Extraction and Reporting Language (PERL) Subversion Configuration Management Tool
System Requirements The LVSR shall… 1. Be capable of incorporating prepared data that conforms to a standard HTK interface (defined in “System Design”). 2. Automatically generate language and acoustic models of all available conforming input data. 3. Be configurable to use multiple processors and/or remote computers to share workload for model re-estimation and testing. 4. Have a scheduling mechanism to run different configuration profiles and create different results directories for each, containing the acoustic and language models. 5. Record all HTK tool output for a “run” in time stamped log files. 6. Merge Language Models together and determine the optimum weighting for models based upon measuring model Perplexity. 7. a list of users information regarding run errors and completion status.
System Design The following directory structure will capture each stage of the workflow on the left:
Data Preparation Phase 1 HTK needs the following items that are custom to each corpus: (OPTIONAL) Dictionary – The list of all words found in both testing and training files in the corpus and their phonetic pronunciations. Should be “ _dict.txt”. Word List – This is a list of all unique words found in the transcriptions. “ _word_list.txt” Training Data List – List of all MFCC data files contributed by the source, using their absolute location on disk. Rename all utterance files to be “corpus_name>_ _.mfcc” “Plain” MLF’s – These only include the words of each utterance. Always create this regardless of timing info availability. “Timed” MLF’s – (OPTIONAL) These included the time boundaries of the appearing words/phones. They must be converted to HTK timing as well. (HTK uses time units of 100ns per unit) Audio Data – convert wav/NIST/sphere format into MFCC using common parameters. Make sure that max length of HTK is observed, splitting as necessary. A custom Perl script is used script to handle each source # Corpus location on disk Location: F:/CORPORA/TIMIT # Sound-splitting threshold (in HTK units) UtteranceSplit: 300 # Coding parameter config reference CodingConfigFile: standard_mfcc_cfg.txt
Data Preparation Phase 2 Data must be merged together. Common data such as dictionaries should be added here. Dictionary – The list of all words found in all files contributed in the corpus and their phonetic pronunciations. Indexed Data Files – All the files from individual sources will be merged into a common area and their filenames will be transformed to a common naming scheme. Word List Training Data List Testing Data List “Plain” MLF’s – These only include the words of each utterance. Always create this regardless of timing info availability. “Timed” MLF’s – (OPTIONAL) These included the time boundaries of the appearing words/phones. They must be converted to HTK timing as well. (HTK uses time units of 100ns per unit) Transcription Files – These are transcription files that are formatted for direct use by the Language Modeling process. Grammar File – By default, this step will generate an “open” grammar from the wordlist. Any word can legally follow another word in the final wordlist. This is used to test acoustic models only # Phone-set information PhoneSet: TIMIT # Coding parameter config reference CodingConfigFile: standard_mfcc_cfg.txt # Parameters to determine percentage of input data that is TRAIN/TEST # must add to 100 TrainDataPercent: 80 TestDataPercent: 20
Acoustic Model Generation The Acoustic Model generation phase will generate multiple versions of HMM definition files that model the input utterances on the phone, and tri-phone level. 1. Prototype HMM is created 2. Create first HMM model for all phones 3. Tie the states for silence model 4. Re-align the models to use all word pronunciations 5. Create tri-phone HMM models 6. Use decision-based clustering to tie triphone model parameters 7. Split the Gaussian Mixtures used for each state. #Acoustic Training Configuration Profiles ProfileName: Basic #settings for pruning and floor values VarianceFloor: 0.01 PruningThresholds: RealignPruneThreshold: #Which corpus contains bootstrap data for iteration 1 BootstrapCorpus: TIMIT #how many calls to HEReest to do inbetween major AM steps ReestimationCount: 2 #file for Tree based clustering logic TreeEditFile: basic_tree.hed #determine target mixtures to apply at end of training GuassianMixtures: 8 MixtureStepSize: 2
Language Model Generation This phase of development will create n-gram language model that will predict a symbol in a sequence given its n-1 predecessors. 1. Training text is scanned and n-grams are counted and stored in grammar files 2. Words are mapped to an “Out-of- Vocabulary Class”. Other class mapping is applied for class-based Language Models 3. The counts of the resulting grammar files are used to compute n-gram probabilities, which are stored in the language model files. 4. The goodness of the language model is measured by calculating perplexity against testing text from the corpus. #these settings dictate the Language Model generation process for all sources MaxNewWords: NGramBufferSize: #will generate up to N gram models NToGenerate: 4 FoFLevels: 32 #must include N-1 cutoff values Cutoffs: 1, 2, 3 #how much this LM should contrib to the overall model OverallContribution: 0.5 #class-model configuration items ClassAmount: 150 ClusterIterations: 1 ClassContribution: 0.7
Model Testing The final phase of the system will be testing the acoustic and language models generate to this point. The results will be cataloged according to the timestamp and the profile name 1. Recognition using acoustic models only and “open” grammar (i.e. no LM applied) 2. Recognition using both AM and LM. # standard HMM/LM testing parameters WordInsertionPenalty: 0.0 GrammarScaleFactor: 5.0 HMMNumbersToTest: 19
Milestones The following actions are given in order with the time estimates for each: 1. TIMIT Data Prep : 6 hours 2. AMI Data Prep : 10 hours 3. Phase 2 Data Prep Sub-System : 20 hours 4. Acoustic Model Sub-System : 20 hours 5. Model Testing Sub-System : 12 hours 6. Lanugage Model Sub-System : 15 hours 7. RTE ‘06 Data Prep : 14 hours 8. Scheduling / Reporting : 14 hours 9. Extra Features / Refactoring : 16 hours 10. Profile Authoring : 4 hours Total Effort Estimate: 131 hours
Open Issues/Questions Can Acoustic and Language Model generation be run in parallel after a common data preparation workflow? Right now, all data input into the LVSR is tagged as training data. What is the best way to choose a subset of data for Testing only? Have a percentage configuration value and pick random utterances? Have a configurable list of specific utterances set aside? If a source (corpus) specifies a testing set, should we use this by default? Which workflow makes more sense for multiple source LM generation: Generate source-specific word level LM, generate source-specific class level LM, interpolate together. Then combine with other source-specific LM’s Use all training text to create a single word-level LM, generate class level LM, then combine to final LM. Proposed architecture is static, requiring the process to be restarted when new data is introduced. What requirements exist for dynamically adding new data to existing models?