Development of CMU Sphinx From 2004 to 2006 Jul An Observer’s Perspective Arthur Chan Evandro Gouvea David Huggins-Daines Mosur Ravishankar Alex Rudnicky Yitao Sun
What is the role of Software in Speech Recognition? The Main Theme for Today:
[For the off-line viewer] [This is Arthur Chan’s conclusion: Joint consideration of 3 software components is crucial.] Read on, you’ll see his argument.
Perspective Mainly Arthur Chan’s observation –Two Roles As a developer –“The Grand Janitor” As an observer of events –A historian.
What is CMU Sphinx? Definition 1 : –Large vocabulary speech recognizers with high accuracy and speed performance. Definition 2 : –A collection of tools and resources that enables developers/researchers to build successful speech recognition systems
Family of CMU Sphinx Decoders –Sphinx {II – IV} –PocketSphinx (by Dave since Oct 2005) Acoustic Model Trainer –SphinxTrain Language Model Trainer (new) Documentation –Hieroglyphs –Robust/SphinxTrain Tutorial
The Sphinx Developers Sphinx is maintained by –Volunteer programmers/researchers who like speech recognition –All contribution go to the same codebase –Goal : Sustainable development of Sphinx Sphinx Developer Meetings are held –Regularly (as in an aperiodic function) –Secretly (in the sense that everyone knows) –to decide the way to go in Sphinx
Outline (~30 pages) Software of Speech Recognition –How should we develop? What should a comprehensive software do? CMU Sphinx, Before/After Lessons Learned (Optional) Team and Structure.
Software of Speech Recognition Systems
The Old Black Box Speech Recognizer Acoustic Signal Word Sequence Legend: The Black Box
What It Means to Software Philosophy behind the old black box “When you don’t know, search.” The old Black Box: –Strongly focus on the decoder –Tend to ignore other important components (e.g. models)
The Noisy Channel Point of View Decoder
What it means to software We need to represent and estimate parameters of the acoustic model We need to represent and estimate parameters of the language model Given the models, we need to search through all possible word sequences. Or decoding
A New Black Box Speech Recognizer Acoustic Signal Word Sequence Acoustic Model Language Model AM TrainerLM Trainer
The New Black Box Philosophy of the New Black Box “When you don’t know, search with your knowledge.” Advantages of the New Black Box –Programmers tend to consider the problems jointly –Reduce communication issues between modules owner
The New Black Box vs The Old Black Box The Old Black Box –Narrow our ways to think of the problem –Motivates solely research on search algorithms The New Black Box –Doesn’t ignore the fact that search is important –But give correct emphasis on all the necessary components
Current CMU Sphinx thinks The New Black Box
Before : CMU Sphinx (2004 Jan)
Sphinx and Friends (2004 Jan) Sphinx Siblings Acoustic Signal Word Sequence Acoustic Model Language Model SphinxTrain CMU-C LM Toolkit V2
Issues at the time Sleeping Decoders: (Sphinx Siblings) –Strength: Comprehensive product line –Issues: Decoders came with many versions, code tends to duplicate Sphinx 2 -> fast but not accurate Sphinx 3.0 -> very accurate but very slow Sphinx 3.3 -> accurate, faster than sphinx 3.0 but slower than 1xRT Sphinx 4 not yet completed
Issues at the time (cont.) AM Trainer (SphinxTrain) –Strength: it works –Weakness: what we supported was simple Where is speaker adaptation? LM Trainer (CMU-Cambridge LM Toolkit V2) –Strength: it works –Weakness: software was sleeping -> development has stopped Important functionalities weren’t in the package: e.g. LM Interpolation
General Comments at the time: “Sphinx cannot do feature Y.” “You have no ideas what you are up to.” “No one is working on Sphinx any more.” “Our job is not difficult but very challenging” –Prof. Alex Rudnicky “Sphinx is cursed.” (I made this one up. ) “The riddle of Sphinx couldn’t be solved” –Made up by Arthur Toth in SphinxLunch
After : CMU Sphinx (now)
Sphinx and Friends (now) Sphinx Brothers Acoustic Signal Word Sequence Acoustic Model Language Model SphinxTrain Debugged CMU-C LM Toolkit V3 alpha
Sphinx Brothers now Sphinx 2 –Could now use CDHMM –Could now use FST Sphinx 3.X (gimmicky name of Sphinx 3) –Could run faster if there are magic tuning string –Merging of Sphinx 3.0 and Sphinx 3.3 –Support speaker adaptation –Re-architected
Sphinx Brothers now (cont.) Sphinx 4 –With great effort of Sun Developers and mainly super speech advisors –Beta completed –Quite popular with users and new startups PocketSphinx (by Dave) –Newly added member of the family –First open source embedded LVCSR
L Project L Project L : Project Ladon: Goal: Extensions and Re-development of CMU-Cambridge LM Toolkit V2 Final product: CMU-Cambridge LM Toolkit Version 3 (alpha)
Story of V3: 3 “Young” Persons and their Inspiring Stories Young StudentYoung Student - write the perl script –Utterly frustrated by training LM, decide to write a set of new perl script Young FacultyYoung Faculty - convince us to license the code in BSD –Wanted to see LM toolkit to be BSD again but has no time. Young StaffYoung Staff – add 32 bit LM support –Had nothing to do on the flight back to HK. –Want to do something he thought was useless.
Function of V3 alpha Support more than 65k words (32 bit LM) Perl wrapper by Young Student –One step LM training –Simplified process of LM interpolation and Class- based LM training New functionalities –LM interpolation (lm_combine) (by Wen, Moss, Dave) –Random text generation in 3-gram (by Arthur Toth) –Modified Kneser-Ney smoothing (by Prof. Yannick Estève from LIUM)
Blessing for this change Support by the license Permissions from all copyright owners –Prof. Rosenfeld (also make decision on licensing issue for CMU) –Dr. Robinson (also make decision on licensing issue for Cambridge) –Dr. Clarkson –Blessing mails sent to public mailing list V3 will be re-licensed under BSD
SphinxTrain now Now support speaker adaptation –MLLR, –MAP, –VTLN Fixed many bugs –Still have many to go Integrated to the tutorials. NR code finally removed, we could distribute it now.
Technology explored in last few years Search / GMM Computation Speaker Adaptation and Normalization Embedded Speech Recognition AM Training LM Training
Future Opportunities - Think the Three Modules Together Technology –N-gram (N>3) (LMtk + SX) –On-line adaptation (SX + ST) –On-line training (SX + ST) Software –Integrated package with comprehensive support on SR (SX + ST + LMtk) –Dictation (SX + ST + LMtk)
Before/After, the difference Spent more time to secure training (both AM and LM) Architecture has been re-thought within module and across modules. Our food-chain is secured in the repository –AM, LM and Decoder’s code are under one code- base (cmusphinx)
Some Good Signs Sourceforge’s Project of the Month ( March 2006) Start to be decently competitive again Someone used our decoder(s) and they look happy –Users actually say “Thank You”. Some companies used our recognizer –(Some of them dare to make profits.)
Some Observations We still need to catch up in accuracy. –Mainly on better algorithmic support on domain specific development Some Observations –Today’s 10xRT system becomes 5 years later 1xRT system –Today’s most accurate system becomes BL of next years most accurate system Now seems to be just another starting point.
Conclusion on Our Technology CMU Sphinx = Open Source SR in BSD
Lessons Learned
Lesson 1 Anyone who tries to solve a legacy problem becomes a legacy problem –Corollary 1: Many legacy decision could actually be clever –Corollary 2: Not every change is good
Lesson 2 : on Research Most of WER decline comes from better acoustic model and language model –Corollary 1: Actually the trainers are the key piece of development. –Corollary 2: We should now focus on 1) acoustic segmenter, 2), speaker adaptation and 3) discriminative training.
Lesson 3: on Development Why some of our code never go into Sphinx? –Code without source controls is close to useless –Corollary 1: If you want your code to survive, check in. –Corollary 2: If you don’t know what is source control, you probably need to learn it.
Lesson 4: My Favorite, the current Sphinx Moto “Never Over/Under-estimate yourself, you never know what kind of mess you could make.” –Dr. Evandro Gouvêa
Acknowledgement – Current Team ? ArthurDavidEvandroYitao
Hiring: The Grand Janitor 2 nd – Mixture of Several Jobs. Release Manager - Kick other people to fix various things Speech Scientist – Tell users to give up when they randomly read some useless papers. System Architect - Rewrite the code in many different ways but do the same thing Mediator of Conflicts - Write pseudo-philosophical comments Core Developer - Write crappy code and occasionally debug them Advisor – Do what Dr. Phil does on your friends, your users and most importantly, your boss(es) and ex- bosses
Acknowledgement – Advisors ? AlexRichAlanRavi
Acknowledement – CMU- Cambridge LM Toolkit Contributors: –David Huggins-Daines, –Ananlada Chotimongkol, –Arthur Toth, –Xu Wen –Prof. Yannick Esteve in LIUM.
Discussion
Thanks
Backup
The Organization of the team
How does it work? The Wrong Model 1, A leader yell: “Sphinx Team Assemble!!” 2, The team then assemble and follow commands of the leader. 3, Things get done. 4, Once again Sphinx Team has saved the day!
How does it really work? 1-3/10 steps 1, Someone in the team dream up with a new feature. 2, He communicate with the team: –“What do you guys think?” 3, Developers start to give their “two cents” on the problem, e.g. –Arthur: “According to Harry G. Frankfurt, what you talk about is B.S.” –Evandro: “Don’t underestimate yourself, you don’t know what kind of mess you will make.” –Dave: “That doesn’t sound like the best idea……”
The guy doesn’t give up and others give OK (4-6/10 steps) 4, He go on to implement the code. 5, Check the code in. 6, Peer review happens right after codes check-in, example comments: –Arthur: “That is not the right balance according to Yin and Yang.” –Evandro: “I wonder whether you know C programming.” –Dave: “What is the rationale behind your change?” –Yitao: “*Sigh*, I need to recompile Speechalyzer and Smartnote again.”
Automatic Tests (the final tests) 7, Run make check –Make sure there is no FAIL in testing –Require pasing 70 to 80 tests. 8, Standard regression tests (make perf-std) –Running tests on 3 corpora and make sure the results are matched the past 9, Machines automated both 7 and 8 –mails sent to everyone daily 10, The code could finally screw up people around the world!
“The Sphinx Developers” Members are all funded by CMU. –different purposes, but check-in to same code-base Common goal priority: –Accuracy –Speed & Accuracy trade off –Memory –Interface –Features –User-Friendliness
Characteristic of our Development The role of manager/lead developer is significantly weakened Release could take some time –require good release management Good architecture is very important Require skillful and knowledgeable programmers Highly practical: results worth more than words and opinions
Missions of the team Take care of CMU’s daily need of quality SR Continue to improve the system Bridge the industry and academia.
Conclusion on Team Current development is –Decentralized –Automated –Skill-demanding We probably want to keep in this way