Network Training for Continuous Speech Recognition

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Locally Constraint Support Vector Clustering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for.
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
IMPROVING RECOGNITION PERFORMANCE IN NOISY ENVIRONMENTS Joseph Picone 1 Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
Temple University Training Acoustic Models Using SphinxTrain Jaykrishna Shukla, Mubin Amehed, and Cara Santin Department of Electrical and Computer Engineering.
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Performance Comparison of Speaker and Emotion Recognition
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
TECHNICAL SEMINAR ON IMPLEMENTATION OF PHONETICS IN CRYPTOGRAPHY BY:- VICKY AGARWAL (4JN03CS078) GUIDED BY:- SREEDEVI.S LECTURER DEPT OF CS&E.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Olivier Siohan David Rybach
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Automatic Fluency Assessment
An Introduction to Support Vector Machines
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Research on the Modeling of Chinese Continuous Speech Recognition
A maximum likelihood estimation and training on the fly approach
Speaker Identification:
Discriminative Training
Combination of Feature and Channel Compensation (1/2)
An M-ary KMP Classifier for Multi-aspect Target Classification
Presentation transcript:

Network Training for Continuous Speech Recognition • Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University • Contact Information: Box 0452 Mississippi State, Mississippi 39762 Tel: 662-325-8335 Fax: 662-325-2298 Good Morning – I would like to welcome everyone to my Masters defense presentation. Email: alphonso@isip.msstate.edu URL: isip.msstate.edu/publications/books/msstate_theses/2003/network_training/

Motivation: Why do we need a new training paradigm? INTRODUCTION ORGANIZATION Motivation: Why do we need a new training paradigm? Theoretical: Review the EM-based supervised training framework. Network Training: The differences between the network training and traditional training. Experiments: Verification of the approach using industry standard databases (e.g., TIDigits, Alphadigits and Resource Management). Motivation Network Training Experiments This presentation is broken down into four major sections. Conclusions

INTRODUCTION MOTIVATION A traditional trainer uses an EM-based framework to estimate the parameters of a speech recognition system. EM-based parameter estimation is performed in several complicated stages which are prone to human error. A network trainer reduces the complexity of the training process by employing a soft decision criterion. A network trainer achieves comparable performance and retains the robustness of the EM-based framework. The traditional training framework has been proven to be a very successful and robust means of re-estimating the parameters of a speech recognition system. The question then arises as to why we need a new training paradigm? The biggest problem with the traditional training framework has always been the complexity of the training process and the degree of supervision that is needed to yield robust models.

NETWORK TRAINER TRAINING RECIPE Flat Start CI Training State Tying CD Training Context-Independent Context-Dependent The flat start stage segments the acoustic signal and seed the speech and non-speech models. The context-independent stage inserts and optional silence model between words. The state-tying stage clusters the model parameters via linguistic rules to compensate for sparse training data. The context-dependent stage is similar to the context-independent stage (words are modeled using context). The flat-start segments the acoustic signal and learns the acoustic representation of the words and silence components.

FLEXIBLE TRANSCRIPTIONS NETWORK TRAINER FLEXIBLE TRANSCRIPTIONS Network Trainer: SILENCE HAVE SILENCE sil hh v Traditional Trainer: ae The network trainer uses word level transcriptions which does not impose restrictions on the word pronunciation. The traditional trainer uses phone level transcriptions which uses the canonical pronunciation of the word. Using orthographic transcriptions removes the need for directly dealing with phonetic contexts during training. Training a speech recognizer is a supervised learning process which mean we require labels (transcriptions) and observations (features).

FLEXIBLE TRANSCRIPTIONS NETWORK TRAINER FLEXIBLE TRANSCRIPTIONS Network Trainer: Traditional Trainer: The network trainer uses a silence word which precludes the need for inserting it into the phonetic pronunciation. The traditional trainer deals with silence between words by explicitly specifying it in the phonetic pronunciation. Using a global model to learn the silence between words requires a multi-path model which accounts for both a long and a short silence duration between words.

DUAL SILENCE MODELLING NETWORK TRAINER DUAL SILENCE MODELLING Multi-Path: Single-Path: The multi-path silence model is used between words. The single-path silence model is used at utterance ends.

DUAL SILENCE MODELLING NETWORK TRAINER DUAL SILENCE MODELLING Using an optional silence model at utterance boundaries caused a bad segmentation of the acoustic signal which resulted in poor performance. We tried seeding the silence model using an example observation but that resulted in poor recognition performance after Flat-Start. The network trainer uses a fixed silence at utterance bounds and an optional silence between words. We use a fixed silence at utterance bounds to avoid an underestimated silence model.

DUAL SILENCE MODELLING NETWORK TRAINER DUAL SILENCE MODELLING Using an optional silence model at utterance boundaries worked on a small data set however the same results do not scale up to large data sets. Network training uses a single-path silence at utterance bounds and a multi-path silence between words. We use a single-path silence at utterance bounds to avoid uncertainty in modeling silence.

TIDIGITS: WER COMPARISON EXPERIMENTS TIDIGITS: WER COMPARISON Stage WER Insertion Rate Deletion Rate Substitution Rate Traditional Trainer 7.7% 0.1% 2.5% 5.0% Network Trainer 7.6% 2.4% The network trainer achieves comparable performance to the traditional trainer. The network trainer converges in word error rate to the traditional trainer. The substitution rate also indicates comparable performance (model confusion).

EXPERIMENTS AD: WER COMPARISON Stage WER Insertion Rate Deletion Rate Substitution Rate Traditional Trainer 38.0% 0.8% 3.0% 34.2% Network Trainer 35.3% 2.2% The network trainer achieves comparable performance to the traditional trainer. The network trainer converges in word error rate to the traditional trainer. The substitution rate also indicates comparable performance (model confusion).

EXPERIMENTS RM: WER COMPARISON Stage WER Insertion Rate Deletion Rate Substitution Rate Traditional Trainer 25.7% 1.9% 6.7% 17.1% Network Trainer 27.5% 2.6% 7.1% 17.9% The network trainer achieves comparable performance to the traditional trainer. It is important to note that the 1.8% degradation in performance is not significant (MAPSSWE test). The substitution rate also indicates comparable performance (model confusion).

CONCLUSIONS SUMMARY Explored the effectiveness of a novel training recipe in the reestimation process of for speech processing. Analyzed performance on three databases. For TIDigits, at 7.6% WER, the performance of the network trainer was better by about 0.1%. For OGI Alphadigits, at 35.3% WER, the performance of the network trainer was better by about 2.7%. For Resource Management, at 27.5% WER, the performance degraded by about 1.8% (not significant).