Brief Overview of Different Versions of Sphinx Arthur Chan
Introduction Software aspect of the recognizer is very important Research always require correct use of the software. Sphinx II + III + IV + SphinxTrain ~= 100 k lines of code Each of them are fairly complex
This presentation (30 pages) Introduction (3 pages) History of Sphinx (13 pages) Sphinx I (2 pages) Sphinx II (2 pages) Sphinx III (3 pages) SphinxTrain (3 pages) Sphinx IV (3 pages) How do I get the source code? (4 pages) Versioning Three rules of not getting lost in different recognizers Where can I get “official” information? (2 pages) Outlook in each recognizer. (3 pages) Conclusion
Brief history of Sphinx Largely adapted from Rita’s “The Sphinx Speech Recognition Systems” Kevin et al’s “Speech Recognition: Past, Present and Future” final.html final.html
Before Sphinx Dragon One of the first use of HMM in speech recognition One of the first use of “purely statistically model” in speech Express the knowledge using HMM network Harpy One of the first use of beam search Use phoneme to represent words.
Sphinx I Before Sphinx …... From AT&T’s literature, the concept of speaker- independence was proposed in 1979 In , most systems are either, Speaker dependent Speaker independent but in a very small domain (<100 words) Sphinx I is therefore outstanding Accuracy is 90% on Resource Management
Sphinx I (1987) By Kai-Fu Lee and Roberto Bisiani Key developer included Hsiao-wuen Hon, Fil Alleva Written in C. Continuous speech recognizer using discrete HMM with 3 codebooks of size 256. Using simple word-pair grammar Generalize triphones Real-time on Sun3 or Dec 3000 Where is the source code? Good antique!
Sphinx II (1992) By Xuedong Huang Hardwired to 5-state Bakis topology 3-gram language models Decision-tree tying of HMM (by Mei-Yuh Huang) 90% in WSJ task (0 or 1?)
Fast Beam Search v. X FBS-6 flat lexicon decoder FBS-7 lexicon tree-based. FBS-8 decoder (written by Ravi Mosur, see thesis in 96) Support multiple types of beam pruning. Lexical tree Tricks in GMM Computation Machine optimization: loop unrolling Predictive Codebook computation Phoneme lookahead Best path search.
Other facts about Sphinx II We license it at the beginning (seem to back till days like 95) In 2000, it starts to be open-sourced in Sourceforge under Berkeley’s style license You could incorporate Sphinx’s source code You don’t need to open your source code. (No recursive legal binding) Similar to LGPL In 2001, a major alpha release by Kevin that ensures portability in several platforms.
Sphinx III flat lexicon decoder (“s3”,“s3flat”,”s3slow”) Sphinx III (by Ravi Mosur) Flat Lexicon Support both CHMM and SCHMM “Poor-man” trigram Use only the most likely first word, this avoid D^2 expansion of the word lattice. Arbitrary topology Very accurate, used in evaluation of BN and others. Derivative from the search include N-best generator Aligner Phone recognizer
Sphinx III tree lexicon decoder (“s3.x”,”s3fast”,”s3inaccurate”) What is s3.x actually? A “spin-off” of the Sphinx III flat lexicon’s source code First use was in BN 10x RT evaluation in 1999 From s3.0 -> s3.2 Use tree-lexicon with unigram lookahead Lexical tree with approximation to avoid memory problem One of the first in the world used Sub-vector quantization in speed-up GMM computation
(cont.) From s3.2 -> s3.3 (Rita, Ricky) Live mode recognizer (livedecode) and simulator (livepretend) From s3.3 -> s3.4 (Evandro, Arthur C, Jahanzeb,) 4-level of speed-up of GMM computation, phoneme lookahead Bug fixes in live mode From s3.4 -> s3.5 (Evandro, Arthur C, Yitao) (Tentative) Speaker adaptation + documentation
Facts about S3 A Java version exists -> sphin3j Open source at ~2002 Always being maintained by Evandro from 2001 to now. s3.5 is the current active branch in S3 development.
SphinxTrain Equally important and very complex But not well understood. What is SphinxTrain? A collection of ~40 tools for Sphinx 2, 3 and 4 acoustic model training A set of perl scripts to do training Sphinx 2 and 3 all have slight different formats of models
Mini-history Baum Welch trainer and Viterbi trainer existed very long time ago. Training tool in general was not systematic and was no structured. From the chaos, Eric Thayer first pull everything together to create the package SphinxTrain Rita did numerous bug fixes and modification of the current trainer Innovate the use of automatic question generation. (make_quest) Built a set of training scripts for RM (the 0*/ scripts) Write the first set of systematic tutorial on training Ricky refined the code and wrote the first set of perl script for Training. He made a PHD out of it too. (PHD = Push Here Dummy!) Alan and Kevin Put the set of code to sourceforge Alan build a set of training script that can “run-through”
Sphinx IV Why Sphinx IV? Too many limitations in SphinxTrain and Sphinx III Only N-gram Approximation of triphones Fast GMM computation could be very troublesome to understood Bw doesn’t skip silence. We heavily rely on force alignement in training.
Sphinx IV (cont.) (By no mean complete……) Lead Design : Bhiksha (MERL) Lead Team Developer : Willer Walker (Sun) Key developers : Evandro, Rita, Phillip Kwok and Paul Lamere Many heavy weight speech advisors: Evandro, Rita, Ravi, Bhiksha, Medro Moreno ……
Is Sphinx IV good? Very accurate, very fast, very versatile and very nicely-pakcaged Java-based speech recognizer Some internal benchmark in RM and WSJ 5k is shown to be faster and more accurate than s3.3 (under 1xRT and 10% better) Support N-gram, FSM and FSG. Will provide facilities like confidence-scoring Still under development (just have first alpha release) Trainer is not stable
Summary of the recognizers and trainers Sphinx I -> obsolete Sphinx II -> we are using the fast recognizer now Sphinx III, the following coexists S3 flat S3 fast (s3.4 stable, s3.5 devel) SphinxTrain (0.92 in the CVS) Sphinx IV Recognizer is alpha released Trainer not yet stable
How can I get version X of Sphinx? Official Web page of Sphinx Give announcement and news of development Some documentation is there. For the tarballs Releases: sphinx2-0.4.tgz (s2) sphinx3-0.1.tgz (s3.3) sphinx3-0.4-rc2.tgz (s3.4 release candidate II) sphinx4-0.1alpha-src.zip (s4)
Rule 2: If it doesn’t exist in CVS, officially it doesn’t exist Simply speaking, no one actually support and maintain them. Software fall into this category: CMU LM Toolkit (we haven’t touched it for a while) We may do it in the future. Phoenix (Distributed somewhere else) Training scripts in csh Rita always actively support it.
Rule 1: If they were no tarballs, they are in CVS ANYONE can get the following modules through CVS by using the following commands: cvs –z3 – co modulname modulename = SphinxTrain -> SphinxTrain archive_s3 -> s3 + s3.0 + s3.2 + s3.3 sphinx2 -> devel ver. of sphinx2 sphinx3 =~ s3.4 -> we will check base on this to develop s3.5 share =~ cepview + lm3g2dmp sphinx3j = the java version of sphinx3 Sphinx4 = development version of sphinx4
Rule 3: You may need other modules to complete your task SphinxTrain heavily rely on force alignment so you also need s3-align Usage of any s3 recognizers required the LM in DMP format so you need the tool lm3g2dmp which can be found in sphinx2 or share.
Where can I get more information for the recognizer? People to ask s2 : Evandro, Ravi S3 flat : Evandro, Ravi, ArthurC S3 tree: Evandro, Ravi, ArthurC SphinxTrain: Rita, Evandro, Ravi, ArthurC, Rong, Ziad, Murali. S4 : S4’s developers in Sourceforge Willie, Paul, Phillip, Bhiksha, Rita, Evandro.
Web page to look up Rita’s web page Contains the manual of training Twiki web page for sphinx 4 design bin/cmusphinx/twiki/view/Sphinx4/WebHome/ bin/cmusphinx/twiki/view/Sphinx4/WebHome/ ArthurC’s web page Risk his life to write a manual for Sphinx 3.4 Also collect some information for each Sphinx
Outlook of all recognizers Sphinx II Sorry, we won’t support it too much. Reason, s3.4 and s4 are proved to have very nice speed and accuracy performance Sphinx III Only active branch is s3.5 Moderate change in s3flat Motivated by project CALO This quarter : make adaptation works. SphinxTrain Write a set of scripts for Continuous HMM training Silence deletion problem will be fixed.
(cont.) sphinxDoc Chapter 1 and 2 completed (*sigh*, still 7 left) Only begin written when Arthur C is procrastinating and don’t want to read and play video game. Will be there at around Sep or Oct. Sphinx IV Alpha release Trainer will be fixed Argus Incorporate the advantages of many speech recognizers together Not yet started.
Conclusion This presentation Summarize the current code status of Sphinx and SphinxTrain. We still have a lot of work to do…… Next presentation s3 or s3.4 from main to the search.