Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

ECE 8443 – Pattern Recognition Objectives: Acoustic Modeling Language Modeling Feature Extraction Search Pronunciation Modeling Resources: J.P.: Speech.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,
CII504 Intelligent Engine © 2005 Irfan Subakti Department of Informatics Institute Technology of Sepuluh Nopember Surabaya - Indonesia.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
7-Speech Recognition Speech Recognition Concepts
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Author: Naveen Parihar Inst. for Signal and Info. Processing Dept.
A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for.
IMPROVING RECOGNITION PERFORMANCE IN NOISY ENVIRONMENTS Joseph Picone 1 Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Course Name: Speech Recognition Course Number: Instructor: Hossein Sameti Department of Computer Engineering Room 706 Phone:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Page 1 of 10 ASR – effect of five parameters on the WER performance of HMM SR system Sanjay Patil, Jun-Won Suh Human and Systems Engineering Experimental.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Hidden Markov Models BMI/CS 576
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
LECTURE 15: REESTIMATION, EM AND MIXTURES
Network Training for Continuous Speech Recognition
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University Contact Information: Box 0452 Mississippi State University Mississippi State, Mississippi Tel: Fax: URL: isip.msstate.edu/publications/books/msstate_theses/2003/network_training/

INTRODUCTION ABSTRACT A traditional trainer uses an expectation maximization (EM) based supervised training framework to estimate the parameters of a speech recognition system. EM- based parameter estimation for speech recognition is performed using several complicated stages of iterative re-estimation. These stages are prone to human error. This thesis describes a new network training paradigm that reduces the complexity of the training process, while retaining the robustness of the EM-based supervised training framework. The network trainer can achieve comparable recognition performance to a traditional trainer while alleviating the need for complicated systems and training recipes for speech recognition systems.

INTRODUCTION ORGANIZATION Motivation: Why do we need a new training paradigm? Theoretical: Review the EM-based supervised training framework. Network Training: The differences between the network training and traditional training. Experiments: Verification of the approach using industry standard databases (e.g., TIDigits, Alphadigits and Resource Management). Motivation Network Training Theoretical Background Experiments Conclusion & Future Work

INTRODUCTION MOTIVATION A traditional trainer uses an EM-based framework to estimate the parameters of a speech recognition system. EM-based parameter estimation is performed in several complicated stages which are prone to human error. A network trainer reduces the complexity of the training process by employing a soft decision criterion. A network trainer achieves comparable performance and retains the robustness of the EM-based framework.

THEORETICAL BACKGROUND COMMUNICATION THEORETIC APPROACH Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Maximum likelihood formulation for speech recognition: P(W|A) = P(A|W) P(W) / P(A) Objective: minimize the word error rate Approach: maximize P(W|A) during training Components: P(A|W) : acoustic model (HMM’s/GMM’s) P(W) : language model (statistical, FSN’s, etc.)

THEORETICAL BACKGROUND MAXIMUM LIKELIHOOD The approach treats the parameters of the model as fixed quantities whose values need to be estimated. The model parameters are estimated by maximizing the log likelihood of observing the training data. The estimation of the parameters is computationally tractable due to the availability of efficient algorithms.    T t i oPOP 1 )]|(log[)]|(log[

THEORETICAL BACKGROUND EXPECTATION MAXIMIZATION A general framework that can be used to determine the maximum likelihood estimates of the model parameters. The algorithm iteratively estimates the likelihood of the model by maximizing Baum’s auxiliary function. The expectation maximization algorithm is guaranteed to converge to the maximum likelihood estimate.   q qOqOPQ)|,log()|,(),(

THEORETICAL BACKGROUND HIDDEN MARKOV MODELS A random process that consists of a set of states and their corresponding transition probabilities: The priori probabilities: The state transition probabilities: The state emission probabilities: ),0(jstatetP j  ),|,1(i ttimejstatettimePa ij  ),|()(jstatettimeOPOb tj 

NETWORK TRAINER TRAINING RECIPE The flat start stage segments the acoustic signal and seed the speech and non-speech models. The context-independent stage inserts and optional silence model between words. The state-tying stage clusters the model parameters via linguistic rules to compensate for sparse training data. The context-dependent stage is similar to the context- independent stage (words are modeled using context). Flat Start CI Training State Tying CD Training Context-Independent Context-Dependent

NETWORK TRAINER TRANSCRIPTIONS sil hh v v Traditional Trainer: ae sil SILENCE HAVE SILENCE Network Trainer: The network trainer uses word level transcriptions which does not impose restrictions on the word pronunciation. The traditional trainer uses phone level transcriptions which uses the canonical pronunciation of the word. Using orthographic transcriptions removes the need for directly dealing with phonetic contexts during training.

NETWORK TRAINER SILENCE MODELS Multi-Path: Single-Path: The multi-path silence model is used between words. The single-path silence model is used at utterance ends.

NETWORK TRAINER DURATION MODELING The network trainer uses a silence word which precludes the need for inserting it into the phonetic pronunciation. The traditional trainer deals with silence between words by explicitly specifying it in the phonetic pronunciation. Network Trainer: Traditional Trainer:

NETWORK TRAINER PRONUNCIATION MODELING A pronunciation network precludes the need to use a single canonical pronunciation for each word. The pronunciation network has the added advantage of being able to generalize to unseen pronunciations. Network Trainer: Traditional Trainer:

NETWORK TRAINER OPTIONAL SILENCE MODELING The network trainer uses a fixed silence at utterance bounds and an optional silence between words. We use a fixed silence at utterance bounds to avoid an underestimated silence model.

NETWORK TRAINER SILENCE DURATION MODELING Network training uses a single-path silence at utterance bounds and a multi-path silence between words. We use a single-path silence at utterance bounds to avoid uncertainty in modeling silence.

EXPERIMENTS SPEECH DATABASES 0% 10% 30% 40% 20% Word Error Rate Level Of Difficulty Digits Continuous Digits Command and Control Letters and Numbers Broadcast News Read Speech Conversational Speech

EXPERIMENTS TIDIGITS DATABASE Collected by Texas Instruments in 1983 to establish a common baseline for connected digit recognition tasks. Includes digits from ‘zero’ through ‘nine’ and ‘oh’ (an alternative pronunciation for ‘zero’). The corpora consists of 326 speakers (111, men, 114 women and 101 children).

EXPERIMENTS TIDIGITS: WER COMPARISON StageWERInsertion Rate Deletion Rate Substitution Rate Traditional Trainer 7.7%0.1%2.5%5.0% Network Trainer 7.6%0.1%2.4%5.0% The network trainer achieves comparable performance to the traditional trainer. The network trainer converges in word error rate to the traditional trainer.

EXPERIMENTS TIDIGITS: LIKELIHOOD COMPARISON Iterations Average Log Likelihood _ _ _ _ Network Trainer ______ Traditional Trainer

Collected by the Oregon Graduate Institute (OGI) using the CLSU T1 data collection system. Includes letters (‘a’ through ‘z’) and numbers (‘zero’ through ‘nine’ and ‘oh’). The database consists of 2,983 speakers (1,419 men, 1,533 women and 30 children). EXPERIMENTS ALPHADIGITS (AD) DATABASE

EXPERIMENTS AD: WER COMPARISON The network trainer achieves comparable performance to the traditional trainer. The network trainer converges in word error rate to the traditional trainer. StageWERInsertion Rate Deletion Rate Substitution Rate Traditional Trainer 38.0%0.8%3.0%34.2% Network Trainer 35.3%0.8%2.2%34.2%

EXPERIMENTS AD: LIKELIHOOD COMPARISON Average Log Likelihood Iterations _ _ _ _ Network Trainer ______ Traditional Trainer

Was collected by the Defense Advanced Research Project Agency (DARPA). Includes a collection of spoken sentences pertaining to a naval RM task. The database consists of 80 speakers, each reading two ‘dialect’ sentences plus 40 sentences from the RM text corpus. EXPERIMENTS RESOURCE MANAGEMENT (RM) DATABASE

EXPERIMENTS RM: WER COMPARISON The network trainer achieves comparable performance to the traditional trainer. It is important to note that the 1.8% degradation in performance is not significant (MAPSSWE test). StageWERInsertion Rate Deletion Rate Substitution Rate Traditional Trainer 25.7%1.9%6.7%17.1% Network Trainer 27.5%2.6%7.1%17.9%

EXPERIMENTS RM: LIKELIHOOD COMPARISON Average Log Likelihood Iterations _ _ _ _ Network Trainer ______ Traditional Trainer

Explored the effectiveness of a novel training recipe in the reestimation process of for speech processing. Analyzed performance on three databases. For TIDigits, at 7.6% WER, the performance of the network trainer was better by about 0.1%. For OGI Alphadigits, at 35.3% WER, the performance of the network trainer was better by about 2.7%. For Resource Management, at 27.5% WER, the performance degraded by about 1.8% (not significant). CONCLUSIONS SUMMARY

The results presented use single-mixture context- dependent models for training and recognition. A efficient tree-based decoder is currently under development and context-dependent results are planned. The databases presented all use single pronunciations for each word in the lexicon. The ability to run large databases like Switchboard, which has multiple pronunciations, requires a tree-based decoder. CONCLUSIONS FUTURE WORK

PROGRAM OF STUDY Course No.TitleSemester CS 8990Probabilistic Expert SystemsSpring 2000 ST 8253Linear RegressionFall 2000 ECE 8990Pattern RecognitionSpring 2001 ECE 8990Information TheorySpring 2001 CS 8990Reinforcement LearningFall 2001 CS 8663Neural ComputingFall 2001 ECE 8990Random Signals and SystemsFall 2001 ECE 8990Fundamentals of Speech RecognitionSpring 2002 ECE 8000Research/Thesis APPENDIX PROGRAM OF STUDY

I would like to thank Dr. Joe Picone for his mentoring and guidance through out my graduate program. I would also like to thank Jon Hamaker for his valuable suggestions throughout my thesis. Finally, I would like to thank my co-workers at the Institute for Signal and Information Processing (ISIP) for all their help. APPENDIX ACKNOWLEDGEMENTS

S. Pinker, The Language Instinct, Harper Collins, New York City, New York, USA, L. Rabiner, B. Juang, Fundamentals of Speech Recognition, Prentice Hall, Upper Saddle River, New Jersey, USA, R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification, John Wiley & Sons, New York City, New York, USA, X. Huang, A. Acero, H. Hon, Spoken Language Processing, Prentice Hall, Upper Saddle River, New Jersey, USA, APPENDIX REFERENCES