Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
ECG Signal processing (2)
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
ECE 8443 – Pattern Recognition Objectives: Acoustic Modeling Language Modeling Feature Extraction Search Pronunciation Modeling Resources: J.P.: Speech.
Supervised Learning Recap
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
The Nature of Statistical Learning Theory by V. Vapnik
Speaker Adaptation for Vowel Classification
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
... NOT JUST ANOTHER PUBLIC DOMAIN SOFTWARE PROJECT ...
Introduction to Automatic Speech Recognition
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.
Outline Separating Hyperplanes – Separable Case
This week: overview on pattern recognition (related to machine learning)
Joseph Picone, PhD Human and Systems Engineering Professor, Electrical and Computer Engineering Bridging the Gap in Human and Machine Performance HUMAN.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for.
Joseph Picone, PhD Human and Systems Engineering Professor, Electrical and Computer Engineering URL:
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author: Aravind.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker Institute for Signal and Information Processing.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
NTU & MSRA Ming-Feng Tsai
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Page 1 of 10 ASR – effect of five parameters on the WER performance of HMM SR system Sanjay Patil, Jun-Won Suh Human and Systems Engineering Experimental.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
LECTURE 16: SUPPORT VECTOR MACHINES
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Statistical Models for Automatic Speech Recognition
LECTURE 17: SUPPORT VECTOR MACHINES
LECTURE 15: REESTIMATION, EM AND MIXTURES
Network Training for Continuous Speech Recognition
Presentation transcript:

Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State University

Abstract Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have dominated the signal processing and pattern recognition literature for the past 20 years. However, HMMs suffer from an inability to learn discriminative information and are prone to over-fitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. SVMs have been shown to provide significant improvements in performance on small pattern recognition tasks compared to a number of conventional approaches. In this presentation, I will describe some of the work that I have done in implementing a kernel based speech recognition system (this is based on work done by Aravind Ganipathiraju). I will then describe our work in using kernel based machines as acoustic models in large vocabulary speech recognition systems. Finally, I will show that SVM’s perform better than Gaussian mixture- based HMMs in open-loop recognition.

Bio Issac Alphonso is a M.S. graduate from the Department of Electrical and Computer Engineering at Mississippi State University (MSU) under the supervision of Dr. Joe Picone. He has been a member of the Institute for Signal and Information Processing (ISIP) at MSU since Mr. Alphonso's work as a graduate student has revolved around exploring new acoustic modeling techniques for continuous speech recognition systems. His most recent work has been in the implementation of a hybrid hierarchical decoder that employs kernel based techniques like Support Vector machines, which replaces the underlying Gaussian distribution in hidden Markov models. His thesis work looks at a new network training framework that reduces the complexity of the training process, while retaining the robustness of the expectation- maximization based supervised training framework.

Outline What we do and how we fit in the big picture What we do and how we fit in the big picture The acoustic modeling problem for speech The acoustic modeling problem for speech Structural risk minimization Structural risk minimization Support vector classifiers Support vector classifiers Coupling vector machines to ASR systems Coupling vector machines to ASR systems Proof of concept and experiments Proof of concept and experiments

Technology Focus: speech recognition Focus: speech recognition First pubic-domain LVCSR system First pubic-domain LVCSR system Goal: Accelerate research Goal: Accelerate research Extensibility, Modular Extensibility, Modular (C++, Java) Easy to Use (Docs, Tutorials, Toolkits) Easy to Use (Docs, Tutorials, Toolkits) Benefit: Technology Benefit: Technology Standard benchmarks Standard benchmarks

Research: Matlab Octave Python Research: Rapid Prototyping “Fair” Evaluations Ease of Use Lightweight Programming Efficiency: Memory Hyper-real time training Parallel processing Data intensive ASR: HTK SPHINX CSLU ISIP: IFC’s Java Apps Toolkits Approach

ASR Problem Front-end maintains information important for modeling in a reduced parameter set Language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams) Search engine uses knowledge sources and models to chooses amongst competing hypotheses

Acoustic Confusability Requires reasoning under uncertainty! Regions of overlap represent classification error Reduce overlap by introducing acoustic and linguistic context Comparison of “aa” in “lOck” and “iy” in “bEAt” for SWB

Probabilistic Formulation

Acoustic Modeling - HMMs HMMs model temporal variation in the transition probabilities of the state machine HMMs model temporal variation in the transition probabilities of the state machine GMM emission densities are used to account for variations in speaker, accent, and pronunciation GMM emission densities are used to account for variations in speaker, accent, and pronunciation Sharing model parameters is a common strategy to reduce complexity Sharing model parameters is a common strategy to reduce complexity s0s0 s1s1 s2s2 s3s3 s4s4 THREE TWO FIVE EIGHT

Hierarchical Search Each node in the hierarchy can dynamically expand to explore sub-networks at the next level. Each node in the hierarchy can dynamically expand to explore sub-networks at the next level. HMM’s are employed at the lowest level of the search hierarchy. HMM’s are employed at the lowest level of the search hierarchy. Word networks can generalize to unseen pronunciation variants in the data. Word networks can generalize to unseen pronunciation variants in the data.

Statistical Models Each state in the HMM is associated with a statistical model (except the non- emitting state and stop). Each state in the HMM is associated with a statistical model (except the non- emitting state and stop). The statistical model can implement any pdf, which follows a defined interface contract. The statistical model can implement any pdf, which follows a defined interface contract. The statistical model can transparently take the form of a GMM or SVM. The statistical model can transparently take the form of a GMM or SVM.

Maximum Likelihood Training Data-driven modeling supervised only from a word-level transcription Approach: maximum likelihood estimation The EM algorithm is used to improve our estimates: Guaranteed convergence to local maximum No guard against overfitting! Computationally efficient training algorithms (Forward- Backward) have been crucial Decision trees are used to optimize sharing parameters, minimize system complexity and integrate additional linguistic knowledge

Drawbacks of Current Approach ML Convergence does not translate to optimal classification Error from incorrect modeling assumptions Finding the optimal decision boundary requires only one parameter!

Drawbacks of Current Approach Data not separable by a hyperplane – nonlinear classifier is needed Gaussian MLE models tend toward the center of mass – overtraining leads to poor generalization

Structural Risk Minimization The VC dimension is a measure of the complexity of the learning machine The VC dimension is a measure of the complexity of the learning machine Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik) Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik) Expected Risk: Not possible to estimate P(x,y) Empirical Risk: Related by the VC dimension, h: Approach: choose the machine that gives the least upper bound on the actual risk VC confidence empirical risk bound on the expected risk VC dimension Expected risk optimum

Support Vector Machines Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally The data points that define the boundary are called support vectors Optimization: Separable Data Hyperplane: Constraints: Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors Final classifier: origin class 1 class 2 w H1 H2 C1 CO C2 optimal classifier

SVMs as Nonlinear Classifiers Data for practical applications typically not separable using a hyperplane in the original input feature space Transform data to higher dimension where hyperplane classifier is sufficient to model decision surface Kernels used for this transformation Final classifier:

Experimental Progression Proof of concept on speech classification using the Deterding vowel corpus Proof of concept on speech classification using the Deterding vowel corpus Coupling the SVM classifier to ASR system Coupling the SVM classifier to ASR system Results on the OGI Alphadigits corpus Results on the OGI Alphadigits corpus

Vowel Classification Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 test Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 test Approach % Error # Parameters SVM: Polynomial Kernels 49% K-Nearest Neighbor 44% Gaussian Node Network 44% SVM: RBF Kernels 35% 83 SVs Separable Mixture Models 30%

Coupling to ASR Data size: Data size: 30 million frames of data in training set 30 million frames of data in training set Solution: Segmental phone models Solution: Segmental phone models Source for Segmental Data: Source for Segmental Data: Solution: Use HMM system in bootstrap procedure Solution: Use HMM system in bootstrap procedure Could also build a segment- based decoder Could also build a segment- based decoder Probabilistic decoder coupling: Probabilistic decoder coupling: SVMs: Sigmoid-fit posterior SVMs: Sigmoid-fit posterior hhawaaryuw region 1 0.3*k frames region 3 0.3*k frames region 2 0.4*k frames mean region 1mean region 2mean region 3 k frames

Coupling to ASR System SEGMENTAL CONVERTER SEGMENTAL CONVERTER HMM RECOGNITION HMM RECOGNITION HYBRID DECODER HYBRID DECODER Features (Mel-Cepstra) Segment Information N-best List Segmental Features Hypothesis

N-Best Rescoring A word-internal N-gram decoder is used to generate the N-best word-graphs. A word-internal N-gram decoder is used to generate the N-best word-graphs. The word-graphs come with the HMM and LM score, which is used in the rescoring process. The word-graphs come with the HMM and LM score, which is used in the rescoring process. The SVM score which is computed during rescoring is used as an additional knowledge source. The SVM score which is computed during rescoring is used as an additional knowledge source.

Alphadigit Recognition OGI Alphadigits: continuous, telephone bandwidth letters and numbers (“A19B4E”) OGI Alphadigits: continuous, telephone bandwidth letters and numbers (“A19B4E”) 3329 utterances using 10-best lists generated by the HMM decoder 3329 utterances using 10-best lists generated by the HMM decoder SVM’s require a sigmoid posterior estimate to produce likelihoods – sigmoid parameters estimated from large held-out set SVM’s require a sigmoid posterior estimate to produce likelihoods – sigmoid parameters estimated from large held-out set

SVM Alphadigit Recognition TranscriptionSegmentationSVMHMM N-bestHypothesis11.0%11.9% N-best+RefReference3.3%6.3% HMM system is cross-word state-tied triphones with 16 mixtures of Gaussian models SVM system has monophone models with segmental features System combination experiment yields another 1% reduction in error

Summary We are the first speech group to apply kernel machines to the acoustic modeling problem We are the first speech group to apply kernel machines to the acoustic modeling problem Performance exceeds that of HMM/GMM system, with a bit of HMM interaction Performance exceeds that of HMM/GMM system, with a bit of HMM interaction Algorithms for increased data sizes are key Algorithms for increased data sizes are key

Acknowledgments Collaborators: Naveen Parihar and Joe Picone at Mississippi State Collaborators: Naveen Parihar and Joe Picone at Mississippi State Consultants: Aravind Ganapathiraju (Conversay) and Jonathan Hamaker (Microsoft) Consultants: Aravind Ganapathiraju (Conversay) and Jonathan Hamaker (Microsoft)

References A. Ganapathiraju, “Support Vector Machines for Speech Recognition”, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, January J. Platt, “Fast Training of Support Vector Machines using Sequential Minimal Optimization,” Advances in Kernel Methods, MIT Press, V.N. Vapnik, “Statistical Learning Theory”, John Wiley, New York, NY, USA, C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” AT&T Bell Laboratories, November 1999.

Accomplishments Developed a set of Java based graphical tools used to demonstrate fundamental concepts in signal processing and speech recognition. s/applets/. Developed a set of Tcl-Tk based graphical tools used to transcribe, segment and analyze speech recognition databases. Developed a generalized network based speech recognition trainer, which is part of my masters thesis work. Developed a hybrid HMM/SVM system used to rescore N-best word- graphs, which is based on work by Aravind Ganipathiraju. Worked as part of a team to design and implement a public-domain HMM-based speech recognition system.