Juicer: A weighted finite-state transducer speech decoder D. Moore1, J. Dines1, M. Magimai Doss1, J. Vepa1, O. Cheng1 and T. Hain2 1 IDIAP Research Institute 2 Department of Computer Science, University of Sheffield
Overview The speech decoding problem Why develop another decoder? WFST theory and practice What is Juicer? Benchmarking experiments The future of Juicer
The speech decoding problem Given a recording and models of speech & language, generate a text transcription of what was said Decoder She had your dark suit…. Models
The speech decoding problem Or…
The speech decoding problem Or…
The speech decoding problem ASR system building blocks Grammar N-gram language model Lexical knowledge pronunciation dictionary Phonetic knowledge context dependency phonological rules Acoustic knowledge state distributions Naive combination of these knowledge sources leads to a large, inefficient representation of the search space
The speech decoding problem The main issue in decoding is carrying out an efficient search of the space defined by the knowledge sources Two ways we can do this: Avoid performing redundant search Don’t pursue unpromising hypotheses An additional issue: flexibility of the decoder
Why develop another decoder? Need of a state-of-the-art speech decoder that is also suitable for on-going research At present, such software is not freely available to the research community Open-source development and distribution framework
WFST theory and practice Maps sequences of input symbols to sequences of output symbols Transition pairs have an associated weight In the example: Input sequence I={a b c d} maps to output sequence O={X Y Z W}, with the path weight a function of all transition weights associated with that path, f(0.1,0.2,0.5,0.1)
WFST theory and practice WFST operations Composition: Combination of transducers Determinisation: Only one transition per input label Minimisation: Least number of states and transitions Weight pushing to aid in minimisation
WFST theory and practice Composition
WFST theory and practice Determinisation
WFST theory and practice Weight pushing & minimisation
WFST theory and practice WFST and speech decoding ASR system building blocks Grammar Lexical knowledge Phonetic knowledge Acoustic knowledge Each of these knowledge sources has a WFST representation
WFST theory and practice WFST and speech decoding Requires some special considerations Lexicon and grammar composition can not be determinised Nor can the context dependency transducer where L, G, C are WFSTs for the grammar, lexicon and context dependency
WFST theory and practice WFST and speech decoding Pros Flexibility Simple decoder architecture Optimised search space Cons Transducer size Knowledge sources are fixed during composition WFST-only knowledge sources
What is Juicer? A time-synchronous Viterbi decoder Tools for WFST construction An interface between 3rd party FSM tools
State-to-phone transducer is not optimised What is Juicer? Decoder Pruning beam search, histogram 1-best output word and model timing information Lattice generation phone level lattice output State-to-phone transducer is not optimised incorporated at run time
What is Juicer? WFST tools gramgen word-loop, word pair, N-gram language models lexgen multiple pronunciations cdgen monophone, word-internal n-phone, cross-word triphone HTK CDHMM and hybrid HMM/ANN model support build-wfst composition, determinisation and minimisation using 3rd party tools (AT&T, MIT)
Benchmarking experiments Experiments were conducted in order to: Compare with existing state-of-the-art decoders Assess the current capabilities and limitations of the decoder Guide future development and research directions
Benchmarking experiments 20k Wall Street Journal Task Equivalent performance on wide beam settings HDecode wins out on narrow beam-widths Only part of the story…
Benchmarking experiments …but what’s the catch? Composition of large static networks: Practically infeasible due to memory limitations Is slow And may not always be necessary System TOT Sub Del Ins P1.HDecode 41.1 21.1 14.7 5.3 P1.Juicer 43.5 23.0 13.7 7.8 P2.HDecode 33.1 15.9 13.4 3.9 P2.Juicer 34.5 16.9 13.6 4.0 Language # of arcs #of arcs FSM Time Model G L C Tool L o G C o L o G Required Pruned-07 4,145,199 127,048 1,065,766 AT&T + MIT 7,008,333 14,945,731 30mins Pruned-08 13,692,081 MIT 23,160,795 50,654,758 1:44 Pruned-09 35,895,383 59,626,339 120,060,629 5:38 Unpruned 98,288,579 DNF 10:33+
Benchmarking experiments AMI Meeting Room Recogniser Decoding for the NIST Rich Transcription evaluations Juicer uses pruned LMs Good trade-off between RTF and WER Chosen operating point
The future of Juicer Further benchmarking Added capabilities Testing against HDecode Trade off between pruned LMs and performance Added capabilities ‘On the fly’ network expansion Word lattice generation Support for MLLR transforms, feature transforms Distribution and support Currently only available to AMI and IM2 partners
Summary Questions? *** but more importantly *** I have presented today… WFST theory and practice The Juicer tools and decoder Preliminary experiments *** but more importantly *** We hope to have generated interest in Juicer