Progress of Sphinx 3.X, From X=4 to X=5 By Arthur Chan Evandro Gouvea Yitao Sun David Huggins-Daines Jahanzeb Sherwani.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Software change management
Configuration management
Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.
CALO Decoder Progress Report for March Arthur (Decoder and ICSI Training) Jahanzeb (Decoder) Ziad (ICSI Training) Moss (ICSI Training) Carnegie Mellon.
Development of CMU Sphinx From 2004 to 2006 Jul An Observer’s Perspective Arthur Chan Evandro Gouvea David Huggins-Daines Mosur Ravishankar Alex Rudnicky.
Alternate Software Development Methodologies
Agile development By Sam Chamberlain. First a bit of history..
Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
Brief Overview of Different Versions of Sphinx Arthur Chan.
Progress of Sphinx 3.X From X=5 to X=6 Arthur Chan Evandro Gouvea David J. Huggins-Daines Alex I. Rudnicky Mosur Ravishankar Yitao Sun.
July 11 th, 2005 Software Engineering with Reusable Components RiSE’s Seminars Sametinger’s book :: Chapters 16, 17 and 18 Fred Durão.
CALO Recorder/Decoder Progress Report for Summer 2004 (July and August) Yitao Sun (Recorder/Decoder) Jason Cohen (Recorder/End-pointer) Thomas Quisel (Recorder)
3 rd Progress Meeting For Sphinx 3.6 Development Arthur Chan, David Huggins-Daines, Yitao Sun Carnegie Mellon University Jan 25, 2006.
2 nd Progress Meeting For Sphinx 3.6 Development Arthur Chan, David Huggins-Daines, Yitao Sun Carnegie Mellon University Jun 7, 2005.
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Application architectures
Almost-Spring Short Course on Speech Recognition Instructors: Bhiksha Raj and Rita Singh Welcome.
Basic Scientific Writing in English Lecture 3 Professor Ralph Kirby Faculty of Life Sciences Extension 7323 Room B322.
Technical Aspects of the CALO Recorder By Satanjeev Banerjee Thomas Quisel Jason Cohen Arthur Chan Yitao Sun David Huggins-Daines Alex Rudnicky.
Sphinx 3.4 Development Progress Arthur Chan, Jahanzeb Sherwani Carnegie Mellon University Mar 4, 2004.
CALO Decoder Progress Report for June Arthur (Decoder, Trainer, ICSI Training) Yitao (Live-mode Decoder) Ziad (ICSI Training) Carnegie Mellon University.
Sphinx 3.4 Development Progress Report in February Arthur Chan, Jahanzeb Sherwani Carnegie Mellon University Mar 1, 2004.
15-Jul-04 FSG Implementation in Sphinx2 FSG Implementation in Sphinx2 Mosur Ravishankar Jul 15, 2004.
CSC230 Software Design (Engineering)
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
Software Configuration Management
Professional Informatics & Quality Assurance Software Lifecycle Manager „Tools that are more a help than a hindrance”
Framework for Automated Builds Natalia Ratnikova CHEP’03.
Temple University Speech Recognition using Sphinx 4 (Ti Digits test) Jaykrishna shukla,Amir Harati,Mubin Amehed,& cara Santin Department of Electrical.
Chapter 2 The process Process, Methods, and Tools
T Project Review Magnificent Seven Project planning iteration
Configuration Management (managing change). Starter Questions... Which is more important?  stability  progress Why is change potentially dangerous?
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
Short Status Report: Documentation Geant4 Workshop at Noorwijk 4 October, 2010 Dennis Wright (for Katsuya Amako)
Comparison of the SPHINX and HTK Frameworks Processing the AN4 Corpus Arthur Kunkle ECE 5526 Fall 2008.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.
Chapter 3: Software Project Management Metrics
ECE450 - Software Engineering II1 ECE450 – Software Engineering II Today: Introduction to Software Architecture.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 21 Slide 1 Software evolution.
APT Configuration Management May 25th, 2004 APT Configuration Management Jesse Doggett.
CSC444F'07Lecture 41 CSC444 Software Engineering Top 10 Practices.
Introduction to System Analysis and Design MADE BY: SIR NASEEM AHMED KHAN DOW VOCATIONAL & TECHNICAL TRAINING CENTRE.
Geant4 Training 2003 A Short Course on Geant4 Simulation Toolkit How to learn more? The full set of lecture notes of this Geant4.
XML 2002 Annotation Management in an XML CMS A Case Study.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
T Iteration Demo LicenseChecker I2 Iteration
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
MANAGEMENT INFORMATION SYSTEM
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Documentation, Best Practices and Procedures: Roadmap
Software Development Life Cycle Waterfall Model
Chapter 18 Maintaining Information Systems
Software Documentation
Design and Implementation
Software Life Cycle Models
Progress Report of Sphinx in Summer 2004 (July 1st to Aug 31st )
CALO Decoder Progress Report for April/May
Sphinx 3.X (X=4) Four-Layer Categorization Scheme of Fast GMM Computation Techniques in Large Vocabulary Continuous Speech Recognition Systems
Progress Report of Sphinx in Q (Sep 1st to Dec 30th)
Automatic Speech Recognition: Conditional Random Fields for ASR
Project Management Process Groups
Sphinx Recognizer Progress Q2 2004
Chapter 8 Software Evolution.
Case Study 1 By : Shweta Agarwal Nikhil Walecha Amit Goyal
Presentation transcript:

Progress of Sphinx 3.X, From X=4 to X=5 By Arthur Chan Evandro Gouvea Yitao Sun David Huggins-Daines Jahanzeb Sherwani

What is CMU Sphinx? Definition 1 : a large vocabulary speech recognizer with high accuracy and speed performance. Definition 2 : a collection of tools and resources that enables developers/researchers to build successful speech recognizers

Brief History of Sphinx More detail version can be found at, Sphinx I 1992 Sphinx II 1996 Sphinx III “S3 slow” 1999 Sphinx III “S3 fast” or S Sphinx become open-source Sphinx IV Development Initiated -S Jul S Oct S3.5 RCII

What is Sphinx 3.X? An extension of Sphinx 3’s recognizers “Sphinx 3.X (X=5)” means “Sphinx 3.5” It helps to confuse people more. Provide functionalities such as Real-time speech recognition Speaker adaptation Developers Application Interfaces (APIs) 3.X (X>3) is motivated by Project CALO

Development History of Sphinx 3.X S3 -Sphinx 3 flat- lexicon recognizer (s3 slow) S3.2 -Sphinx 3 tree-lexicon recognizer (s3 fast) S3.3 -w live-mode demo S3.4 -fast GMM computation -support class-based LM -some support for dynamic LM S3.5 –some support on speaker adaptation -live mode APIs -Sphinx 3 and Sphinx 3.X code merge

This talk A general summary of what’s going on. Less technical than 3.4 talk Folks were so confused by jargons in speech recognition’s black magic. More for code development, less for acoustic modeling Reason: I have not much time to do both (Incorrect version): “We need to adopt the latest technology to clown 2 to 3 Arthur Chan(s) for the CALO project.” –Prof. Alex Rudnicky, in one CALO meeting in 2004 (“Kindly” corrected by Prof. Alan Black): “We need to adopt the latest technology to clone 2 to 3 Arthur Chan(s) for the CALO project.” –Prof. Alex Rudnicky, in one CALO meeting in 2004 More on a project point of view Speech recognition software easily shows phenomena described in “Mythical Man-Month”.

This talk (outline) Sphinx 3.X, The recognizer (From X=4 to X=5) (~10 pages) Accuracy and Speed (5 pages) Speaker Adaptation (1 page) Application Interfaces (APIs) (2 pages) Architecture (2 pages) Sphinx as a collection of resources (~10 pages) Code distribution and management (3 pages) Infrastructure of Training (1 page) SphinxTrain: tools of training acoustic models. (1 page) Documentation (3 pages) Team and Organization (2 pages) Development plan for Sphinx 3.X (X >= 6) (2 pages) Relationship between speech recognition and other speech researches. (4 pages)

Accuracy and Speed Why Sphinx 3.X ? Why not Sphinx 2? Due to the limitation of computation in 90s S2 only support restricted version of semi-continuous HMM (SCHMM) S3.X supports fully continuous HMM (FCHMM) Accuracy improvement is around relative 30% You will see benchmarking results two slides later Speed S3.X is still slower than S2 But in many tasks, it seems to becomes reasonable to use it. (YOU CAN FIND THE RESULTS FEWS SLIDES)

Speed Fast Search techniques Lexical tree search (s3.2) Viterbi beam tuning and Histogram beam Pruning(s3.2) Ravi’s talk Phoneme look-ahead (s3.4 by Jahanzeb) Fast GMM computation techniques (s3.4) Using the measurement in the literature, that means 75%-90% of GMM computation reduction with fast GMM computation + pruning. <10% relative degradation can usually be achieved in clean database. Further Detail: “Four-Layer Categorization Scheme of Fast GMM Computation Techniques“ A. Chan et al.

Accuracy Benchmarking (Communicator Task) Test platform, 2.2G Pentium IV CMU Communicator task Vocabulary size (3k), perplexity: ~90 All tunings were done without sacrificing 5% performance. Batch mode decoder is used. (decode) Sphinx 2 (tuned w speed-up techniques) WER: 17.8% (0.34xRT) Baseline results Sphinx 3.X 32 gaussian-FCGMM WER: % (2.40xRT) Baseline results Sphinx 3.X, 64 gaussian-FCGMM WER: 11.7% (~3.67xRT) Tuned Sphinx 3.X 64 gaussian-FCGMM WER: % (0.87 xRT), % (1.17xRT) Rong can make it better: Boosting training results : 10.5%

Accuracy/Speed Benchmarking (WSJ Task) Test platform, 2.2G Pentium Vocabulary Size (5k) Standard NVP task. Trained by both WSJ0 and WSJ1 Sphinx 2, 14.5% (?) Sphinx 3.X, 8 gaussian-FCGMM un-tuned 7.3% 1.6xRT tuned: 8.29% 0.52xRT

Accuracy/Speed Benchmarking (Future Plan) Issue 1 : Large variance in GMM computation. Average performance is good, worse case can be disastrous. Issue 2 : Tuning requires a black magician Automatic tuning is necessary. Issue 3 : Still need to work on larger databases (e.g. WSJ 20k, BN) training setup need to be dig up Issue 4 : Speed up in noisy corpus is tricky. Results are not satisfactory (20-30% degradation in accuracy)

Speaker Adaptation Start to support MLLR-based speaker adaptation y=Ax+b, estimate A, b in a maximum likelihood fashion (Legetter 94) Current functionality of sphinx 3.X + SphinxTrain Allow estimation of transformation matrix Transforming means offline Transforming means online Decoder only support single regression class. Code gives exactly the same results as Sam Joo’s code. Not fully benchmarked yet, still experimental

Live-mode APIs Thanks to Yitao Sets of C APIs that provide recognition functionality Close to Sphinx 2’s style of APIs Speech recognition resource initialization/un-initialization Functions for Utterance level begin/end/process waveforms

Live-mode APIs : What are missing? What we lack Dynamic LM addition and deletion part of the plan of s3.6 Finite state machine implementation part of plan of s3.X where X=8 or 9 End-pointer integration and APIs Ziad Al Bawab’s model-based classifier Now as a customized version, s3ep

Architecture “Code duplication is the root of many evils” Four tools of s3 are now incorporated into S3.5 align : an aligner allphone : a phoneme recognizer astar : lattice to N-best generation dag : lattice best-path search Many thanks to Dr. Carl Quillen of MIT Lincoln

Architecture : Next Step decode_anytopo will be the next Things we may incorporate someday SphinxTrain CMU-Cambridge LM Toolkit lm3g2dmp and cepview

Code Distribution and Management Distribution Internal Release -> RC I -> RC II.. -> RC N If no one yell during calm-down period of RC N Then, put a tar ball on Sourceforge web page At every release, Distribution have to go through ~10 platforms of compilation First announcement usually made at the RC period. Web page is maintained by Evandro (<-extremely sane)

Digression: Other versions of Sphinx 3.X Code that are Not satisfying design goal of the software S3 slow w/ GMM Computation S3.5 with end-pointer CMU Researchers’ code and implementation E.g. According to legend, Rita has >10 versions of Sphinx and SphinxTrain.

Code Management Concurrent Versions System (CVS) is used in Sphinx Also used in other projects e.g. CALO and Festival A very effective way to tie resource and knowledge together Problems : Still has a lot of separate versions of code in CMU not in Sphinx’s CVS. Please kindly contact us if you work on something using Sphinx or derived from Sphinx

Infrastructure of Training A need for persistence and version control Baseline were lost after several years. setup will be now available in CVS for Communicator (11.5%) WSJ 5k NVP (7.3%) ICSI Phase 3 Training Far from the state of the art Need to re-engineer and do archeology Will add more tasks to the archive You are welcomed to change the setup if you don’t like it But you need to check in what you have done

SphinxTrain SphinxTrain is never officially released Still under work. For sphinx3.X (X>=5), corresponding timestamp of SphinxTrain will also be published. Recent Progress Better on-line help Added support for adaptation Better support in perl scripts for FCHMM (Evandro) Silence deletion in Baum-Welch Training (experimental)

Hieroglyph: Using Sphinx for building speech recognizers Project Hieroglyphs An effort to build a set of complete documentation for using Sphinx, SphinxTrain and CMU LM Toolkit fo building speech applications. Largely based on Evandro, Rita, Ravi, Roni’s docs. “Editor”: Arthur Chan <- do a lot of editing Authors: Arthur, David, Evandro, Rita, Ravi, Roni, Yitao

Hieroglyph: An outline Chapter 1: Licensing of Sphinx, SphinxTrain and LM Toolkit Chapter 2: Introduction to Sphinx Chapter 3: Introduction to Speech Recognition Chapter 4: Recipe of Building Speech Application using Sphinx Chapter 5: Different Software Toolkits of Sphinx Chapter 6: Acoustic Model Training Chapter 7: Language Model Training Chapter 8: Search Structure and Speed-up of the Speech recognizer Chapter 9: Speaker Adaptation Chapter 10: Research using Sphinx Chapter 11: Development using Sphinx Appendix A: Command Line Information Appendix B: FAQ

Hieroglyph: Status Still in the drafting stage Chapter I : License and use of Sphinx, SphinxTrain and CMU LM Toolkit (1 st draft, 3 rd Rev) Chapter II : Introduction to Sphinx, SphinxTrain and CMU LM Toolkit (1 st draft, 1 st Rev) Chapter VIII : Search Structure and Speed-up of Sphinx's recognizers (1 st draft, 1 st Rev) Chapter IX: Speaker adaptation using Sphinx (1 st draft, 2 nd Rev) Chapter XI: Development using Sphinx (1 st draft, 1 st Rev) Appendix A.2: Full SphinxTrain Command Line Information (1 st draft, 2 nd Rev) Writing Quality : Low The 1 st draft will be completed ½ year later (hopefully)

Team and Organization “Sphinx Developers”: A group of volunteers who maintain and enhance Sphinx and related resources Current Members: Arthur Chan (Project Manager / Coordinator) Evandro Gouvea (Maintainer / Developer) David Huggins-Daines (Developer) Yitao Sun (Developer) Ravi Mosur (Speech Advisor) Alex Rudnicky (Speech Advisor) All of you Application Developers Modeling experts Linguists Users

Team and Organization We need help! Several positions are still available for volunteers: Project Manager : Enable Development of Sphinx Translation: kick/fix miscellaneous people (lightly) everyday. Maintainer : Ensure integrity of Sphinx code and resource Translation: a good chance for you to understand life more Tester : Enable test-based development in Sphinx Translation: a good way to increase blood pressure. Developers : Incorporate state-of-art technology into Sphinx Translation: deal with legacy code and start to write legacy code yourself For your projects, you can also send us temp people. Regular meetings are scheduled biweekly. Though, if we are too busy, we just skip it.

Next 6 months: Sphinx 3.6 More refined speaker adaptation More support on dynamic LM More speed-up of the code Better documentation (Complete 1 st Draft of Hieroglyph?) Confidence measure(?)

If we still survive and have a full team…… Roadmap of Sphinx 3.X (X>6) X=7, Decoder, Trainer code merge FSG implementation Confidence annotation X=8 : Trainer fixes LM manipulation support X=9 : Better covariance modeling and speaker adaptation Hieroglyph completed X>= 10 : To move on, innovation is necessary.

Speech recognition and other Research The goal of Sphinx Support innovation and development of new speech applications A conscious and correct decision in long term speech recognition research In Speech Synthesis: aligner is important for unit selection In Parsing/Dialog Modeling: Sphinx 3.X still has a lot of errors! We still need Phoenix! (Robust Parser) We still need Ravenclaw House! (Dialog Manager) In Speech Applications Good recognizer is the basis

Cost of Research in Speech Recognition 30% WER reduction is usually perceivable to users i.e. roughly translate to 1-2 good algorithmic improvements Under a well-educated researchers group known techniques usually require ½ year to implement and test. Unknown techniques will take more time. (1 year per innovation) Experienced developers : 1 month to implement known techniques 3 months to innovate

Therefore…… It still makes sense to continuously support on, speech recognizer development acoustic modeling improvement. To consolidate, what we were lacking 1, code and project management Multi-developer environment is strictly essential. 2, transferal of research to development 3, acoustic modeling research: discriminative training, speaker adaptation

Future of Sphinx 3.X ICSLP 2004 “From Decoding Driven to Detection-Based Paradigms for Automatic Speech Recognition” by Prof. Chin-Hui Lee Speech Recognition: Still an open problem at 2004 Role of Speech Recognition in Speech Application: Still largely unknown Require open minds to understand

Conclusion We’ve done something in 2004 Our effort starts to make a difference We still need to do more in 2005 Making a Sphinx 3.X a backbone of speech application development Consolidation of the current research and development in Sphinx Seek for ways for sustainable development

Thank you!