A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker Institute for Signal and Information Processing.

A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker hamaker@isip.msstate.edu Institute for Signal and Information Processing Mississippi State University

Abstract Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have dominated the signal processing and pattern recognition literature for the past 20 years. However, HMMs suffer from an inability to learn discriminative information and are prone to over-fitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. SVMs have been shown to provide significant improvements in performance on small pattern recognition tasks compared to a number of conventional approaches. SVMs, however, require ad hoc (and unreliable) methods to couple it to probabilistic learning machines. Probabilistic Bayesian learning machines, such as the relevance vector machine (RVM), are fairly new approaches that attempt to overcome the deficiencies of SVMs by explicitly accounting for sparsity and statistics in their formulation. In this presentation, we describe both of these modeling approaches in brief. We then describe our work to integrate these as acoustic models in large vocabulary speech recognition systems. Particular attention is given to algorithms for training these learning machines on large corpora. In each case, we find that both SVM and RVM-based systems perform better than Gaussian mixture-based HMMs in open- loop recognition. We further show that the RVM-based solution performs on par with the SVM system using an order of magnitude fewer parameters. We conclude with a discussion of the remaining hurdles for providing this technology in a form amenable to current state-of-the-art recognizers.

Bio Jon Hamaker is a Ph.D. candidate in the Department of Electrical and Computer Engineering at Mississippi State University under the supervision of Dr. Joe Picone. He has been a senior member of the Institute for Signal and Information Processing (ISIP) at MSU since 1996. Mr. Hamaker's research work has revolved around automatic structural analysis and optimization methods for acoustic modeling in speech recognition systems. His most recent work has been in the application of kernel machines as replacements for the underlying Gaussian distribution in hidden Markov acoustic models. His dissertation work compares the popular support vector machine with the relatively new relevance vector machine in the context of a speech recognition system. Mr. Hamaker has co-authored 4 journal papers (2 under review), 22 conference papers, and 3 invited presentations during his graduate studies at MS State (http://www.isip.msstate.edu/publications). He also spent two summers as an intern at Microsoft in the recognition engine group. http://www.isip.msstate.edu/publications

Outline The acoustic modeling problem for speech The acoustic modeling problem for speech Current state-of-the-art Current state-of-the-art Discriminative approaches Discriminative approaches Structural optimization and Occam’s Razor Structural optimization and Occam’s Razor Support vector classifiers Support vector classifiers Relevance vector classifiers Relevance vector classifiers Coupling vector machines to ASR systems Coupling vector machines to ASR systems Scaling relevance vector methods to “real” problems Scaling relevance vector methods to “real” problems Extensions of this work Extensions of this work

ASR Problem Front-end maintains information important for modeling in a reduced parameter set Language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams) Search engine uses knowledge sources and models to chooses amongst competing hypotheses Input Speech Statistical Acoustic Models p(A/W) Language Model p(W) Acoustic Front-End Recognized Utterance Search Focus of Work

Acoustic Confusability Requires reasoning under uncertainty! Regions of overlap represent classification error Reduce overlap by introducing acoustic and linguistic context Comparison of “aa” in “lOck” and “iy” in “bEAt” for SWB

Probabilistic Formulation To deal with the uncertainty, we typically formulate speech as a probabilistic problem: To deal with the uncertainty, we typically formulate speech as a probabilistic problem: Objective: Minimize the word error rate by maximizing P(W|A) Objective: Minimize the word error rate by maximizing P(W|A) Approach: Maximize P(A|W) during training Approach: Maximize P(A|W) during training Components: Components: P(A|W): Acoustic Model P(A|W): Acoustic Model P(W): Language Model P(W): Language Model P(A): Acoustic probability (ignored during maximization) P(A): Acoustic probability (ignored during maximization)

Acoustic Modeling - HMMs HMMs model temporal variation in the transition probabilities of the state machine HMMs model temporal variation in the transition probabilities of the state machine GMM emission densities are used to account for variations in speaker, accent, and pronunciation GMM emission densities are used to account for variations in speaker, accent, and pronunciation Sharing model parameters is a common strategy to reduce complexity Sharing model parameters is a common strategy to reduce complexity s0s0 s1s1 s2s2 s3s3 s4s4 THREE TWO FIVE EIGHT

Maximum Likelihood Training Data-driven modeling supervised only from a word-level transcription Approach: maximum likelihood estimation The EM algorithm is used to improve our estimates: Guaranteed convergence to local maximum No guard against overfitting! Computationally efficient training algorithms (Forward- Backward) have been crucial Decision trees are used to optimize sharing parameters, minimize system complexity and integrate additional linguistic knowledge

Drawbacks of Current Approach ML Convergence does not translate to optimal classification Error from incorrect modeling assumptions Finding the optimal decision boundary requires only one parameter!

Drawbacks of Current Approach Data not separable by a hyperplane – nonlinear classifier is needed Gaussian MLE models tend toward the center of mass – overtraining leads to poor generalization

Acoustic Modeling Acoustic Models Must: Acoustic Models Must: Model the temporal progression of the speech Model the temporal progression of the speech Model the characteristics of the sub-word units Model the characteristics of the sub-word units We would also like our models to: We would also like our models to: Optimally trade-off discrimination and representation Optimally trade-off discrimination and representation Incorporate Bayesian statistics (priors) Incorporate Bayesian statistics (priors) Make efficient use of parameters (sparsity) Make efficient use of parameters (sparsity) Produce confidence measures of their predictions for higher-level decision processes Produce confidence measures of their predictions for higher-level decision processes

Paradigm Shift - Discriminative Modeling Discriminative Training (Maximum Mutual Information Estimation) Discriminative Training (Maximum Mutual Information Estimation) Essential Idea: Maximize Essential Idea: Maximize Maximize numerator (ML term), minimize denominator (discriminative term) Maximize numerator (ML term), minimize denominator (discriminative term) Discriminative Modeling (e.g. ANN Hybrids – Bourlard and Morgan) Discriminative Modeling (e.g. ANN Hybrids – Bourlard and Morgan)

Research Focus Our Research: replace the Gaussian likelihood computation with a machine that incorporates notions of Our Research: replace the Gaussian likelihood computation with a machine that incorporates notions of Discrimination Discrimination Bayesian statistics (prior information) Bayesian statistics (prior information) Confidence Confidence Sparsity Sparsity All while maintaining computational efficiency All while maintaining computational efficiency

ANN Hybrids Shortcomings: Prone to overfitting: require cross-validation to determine when to stop training. Need methods to automatically penalize overfitting Prone to overfitting: require cross-validation to determine when to stop training. Need methods to automatically penalize overfitting No substantial recognition improvements over HMM/GMM No substantial recognition improvements over HMM/GMM Architecture: ANN provides flexible, discriminative classifiers for emission probabilities that avoid HMM independence assumptions (can use wider acoustic context) Trained using Viterbi iterative training (hard decision rule) or can be trained to learn Baum-Welch targets (soft decision rule) Input Feature Vector ……………….. …..... P(c 1 |o) … P(c n |o)

Structural Optimization Structural optimization often guided by an Occam’s Razor approach Structural optimization often guided by an Occam’s Razor approach Trading goodness of fit and model complexity Trading goodness of fit and model complexity Examples: MDL, BIC, AIC, Structural Risk Minimization, Automatic Relevance Determination Examples: MDL, BIC, AIC, Structural Risk Minimization, Automatic Relevance Determination Model Complexity Error Training Set Error Open-Loop Error Optimum

Structural Risk Minimization The VC dimension is a measure of the complexity of the learning machine The VC dimension is a measure of the complexity of the learning machine Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik) Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik) Expected Risk: Not possible to estimate P(x,y) Empirical Risk: Related by the VC dimension, h: Approach: choose the machine that gives the least upper bound on the actual risk VC confidence empirical risk bound on the expected risk VC dimension Expected risk optimum

Support Vector Machines Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally The data points that define the boundary are called support vectors Optimization: Separable Data Hyperplane: Constraints: Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors Final classifier: origin class 1 class 2 w H1 H2 C1 CO C2 optimal classifier

SVMs as Nonlinear Classifiers Data for practical applications typically not separable using a hyperplane in the original input feature space Transform data to higher dimension where hyperplane classifier is sufficient to model decision surface Kernels used for this transformation Final classifier:

SVMs for Non-Separable Data No hyperplane could achieve zero empirical risk (in any dimension space!) No hyperplane could achieve zero empirical risk (in any dimension space!) Recall the SRM Principle: trade-off empirical risk and model complexity Recall the SRM Principle: trade-off empirical risk and model complexity Relax our optimization constraint to allow for errors on the training set: Relax our optimization constraint to allow for errors on the training set: A new parameter, C, must be estimated to optimally control the trade-off between training set errors and model complexity A new parameter, C, must be estimated to optimally control the trade-off between training set errors and model complexity

SVM Drawbacks Uses a binary (yes/no) decision rule Uses a binary (yes/no) decision rule Generates a distance from the hyperplane, but this distance is often not a good measure of our “confidence” in the classification Generates a distance from the hyperplane, but this distance is often not a good measure of our “confidence” in the classification Can produce a “probability” as a function of the distance (e.g. using sigmoid fits), but they are inadequate Can produce a “probability” as a function of the distance (e.g. using sigmoid fits), but they are inadequate Number of support vectors grows linearly with the size of the data set Number of support vectors grows linearly with the size of the data set Requires the estimation of trade-off parameter, C, via held-out sets Requires the estimation of trade-off parameter, C, via held-out sets

Evidence Maximization Build a fully specified probabilistic model – incorporate prior information/beliefs as well as a notion of confidence in predictions Build a fully specified probabilistic model – incorporate prior information/beliefs as well as a notion of confidence in predictions MacKay posed a special form for regularization in neural networks – sparsity MacKay posed a special form for regularization in neural networks – sparsity Evidence maximization: evaluate candidate models based on their “evidence”, P(D|H i ) Evidence maximization: evaluate candidate models based on their “evidence”, P(D|H i ) Structural optimization by maximizing the evidence across all candidate models! Structural optimization by maximizing the evidence across all candidate models! Steeped in Gaussian approximations Steeped in Gaussian approximations

Evidence Framework Penalty that measures how well our posterior model fits our prior assumptions: Penalty that measures how well our posterior model fits our prior assumptions: We can use set the prior in favor of sparse, smooth models! We can use set the prior in favor of sparse, smooth models! Evidence approximation: Likelihood of data given best fit parameter set:  w w  w P(w|D,H i ) P(w|H i )

A kernel-based learning machine A kernel-based learning machine Incorporates an automatic relevance determination (ARD) prior over each weight (MacKay) Incorporates an automatic relevance determination (ARD) prior over each weight (MacKay) A flat (non-informative) prior over  completes the Bayesian specification A flat (non-informative) prior over  completes the Bayesian specification Relevance Vector Machines

The goal in training becomes finding: The goal in training becomes finding: Estimation of the “sparsity” parameters is inherent in the optimization – no need for a held-out set! Estimation of the “sparsity” parameters is inherent in the optimization – no need for a held-out set! A closed-form solution to this maximization problem is not available. Rather, we iteratively reestimate A closed-form solution to this maximization problem is not available. Rather, we iteratively reestimate

Laplace’s Method Fix  and estimate w (e.g. gradient descent) Fix  and estimate w (e.g. gradient descent) Use the Hessian to approximate the covariance of a Gaussian posterior of the weights centered at Use the Hessian to approximate the covariance of a Gaussian posterior of the weights centered at With and as the mean and covariance, respectively, of the Gaussian approximation, we find by finding With and as the mean and covariance, respectively, of the Gaussian approximation, we find by finding Method is O(N 2 ) in memory and O(N 3 ) in time Method is O(N 2 ) in memory and O(N 3 ) in time

RVMs Compared to SVMs RVM Data: Class labels (0,1) Data: Class labels (0,1) Goal: Learn posterior, P(t=1|x) Goal: Learn posterior, P(t=1|x) Structural Optimization: Hyperprior distribution encourages sparsity Structural Optimization: Hyperprior distribution encourages sparsity Training: iterative – O(N 3 ) Training: iterative – O(N 3 ) SVM Data: Class labels (-1,1) Goal: Find optimal decision surface under constraints Structural Optimization: Trade-off parameter that must be estimated Training: Quadratic – O(N 2 )

Simple Example

ML Comparison

SVM Comparison

SVM With Sigmoid Posterior Comparison

RVM Comparison

Experimental Progression Proof of concept on speech classification data Proof of concept on speech classification data Coupling classifiers to ASR system Coupling classifiers to ASR system Reduced-set tests on Alphadigits task Reduced-set tests on Alphadigits task Algorithms for scaling up RVM classifiers Algorithms for scaling up RVM classifiers Further tests on Alphadigits task (still not the full training set though!) Further tests on Alphadigits task (still not the full training set though!) New work aiming at larger data sets and HMM decoupling New work aiming at larger data sets and HMM decoupling

Vowel Classification Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 SI test Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 SI test Approach % Error # Parameters SVM: Polynomial Kernels 49% K-Nearest Neighbor 44% Gaussian Node Network 44% SVM: RBF Kernels 35% 83 SVs Separable Mixture Models 30% RVM: RBF Kernels 30% 13 RVs

Coupling to ASR Data size: Data size: 30 million frames of data in training set 30 million frames of data in training set Solution: Segmental phone models Solution: Segmental phone models Source for Segmental Data: Source for Segmental Data: Solution: Use HMM system in bootstrap procedure Solution: Use HMM system in bootstrap procedure Could also build a segment- based decoder Could also build a segment- based decoder Probabilistic decoder coupling: Probabilistic decoder coupling: SVMs: Sigmoid-fit posterior SVMs: Sigmoid-fit posterior RVMs: naturally probabilistic RVMs: naturally probabilistic hhawaaryuw region 1 0.3*k frames region 3 0.3*k frames region 2 0.4*k frames mean region 1mean region 2mean region 3 k frames

Coupling to ASR System SEGMENTAL CONVERTER SEGMENTAL CONVERTER HMM RECOGNITION HMM RECOGNITION HYBRID DECODER HYBRID DECODER Features (Mel-Cepstra) Segment Information N-best List Segmental Features Hypothesis

Alphadigit Recognition OGI Alphadigits: continuous, telephone bandwidth letters and numbers (“A19B4E”) OGI Alphadigits: continuous, telephone bandwidth letters and numbers (“A19B4E”) Reduced training set size for RVM comparison: 2000 training segments per phone model Reduced training set size for RVM comparison: 2000 training segments per phone model Could not, at this point, run larger sets efficiently Could not, at this point, run larger sets efficiently 3329 utterances using 10-best lists generated by the HMM decoder 3329 utterances using 10-best lists generated by the HMM decoder SVM and RVM system architecture are nearly identical: RBF kernels with gamma = 0.5 SVM and RVM system architecture are nearly identical: RBF kernels with gamma = 0.5 SVM requires the sigmoid posterior estimate to produce likelihoods – sigmoid parameters estimated from large held-out set SVM requires the sigmoid posterior estimate to produce likelihoods – sigmoid parameters estimated from large held-out set

SVM Alphadigit Recognition TranscriptionSegmentationSVMHMM N-bestHypothesis11.0%11.9% N-best+RefReference3.3%6.3% HMM system is cross-word state-tied triphones with 16 mixtures of Gaussian models SVM system has monophone models with segmental features System combination experiment yields another 1% reduction in error

SVM/RVM Alphadigit Comparison RVMs yield a large reduction in the parameter count while attaining superior performance RVMs yield a large reduction in the parameter count while attaining superior performance Computational costs mainly in training for RVMs but is still prohibitive for larger sets Computational costs mainly in training for RVMs but is still prohibitive for larger sets ApproachErrorRate Avg. # Parameters Training Time Testing Time SVM16.4%257 0.5 hours 30 mins RVM16.2%12 30 days 1 min

Scaling Up Central to RVM training is the inversion of an MxM Hessian matrix: an O(N 3 ) operation initially Central to RVM training is the inversion of an MxM Hessian matrix: an O(N 3 ) operation initially Solutions: Solutions: Constructive Approach: Start with an empty model and iteratively add candidate parameters. M is typically much smaller than N Constructive Approach: Start with an empty model and iteratively add candidate parameters. M is typically much smaller than N Divide and Conquer Approach: Divide complete problem into set of sub-problems. Iteratively refine the candidate parameter set according to sub- problem solution. M is user-defined Divide and Conquer Approach: Divide complete problem into set of sub-problems. Iteratively refine the candidate parameter set according to sub- problem solution. M is user-defined

Constructive Approach Tipping and Faul (MSR-Cambridge) Tipping and Faul (MSR-Cambridge) Define Define has a unique solution with respect to has a unique solution with respect to The results give a set of rules for adding vectors to the model, removing vectors from the model or updating parameters in the model The results give a set of rules for adding vectors to the model, removing vectors from the model or updating parameters in the model

Constructive Approach Algorithm Prune all parameters; While not converged For each parameter: If parameter is pruned: checkAddRule checkAddRule Else: checkPruneRule checkUpdateRule checkUpdateRuleEnd Update Model End Begin with all weights set to zero and iteratively construct an optimal model without evaluating the full NxN inverse Formed for RVM regression – can have oscillatory behavior for classification Rule subroutines require the full design matrix (NxN) storage requirement

Iterative Reduction Algorithm O(M 3 ) in run-time and O(MxN) in memory. M is a user-defined parameter O(M 3 ) in run-time and O(MxN) in memory. M is a user-defined parameter Assumes that if P(w k =0|w I,J,D) is 1 then P(w k =0|w,D) is also 1! Optimality? Assumes that if P(w k =0|w I,J,D) is 1 then P(w k =0|w,D) is also 1! Optimality? Candidate Pool Candidate Pool TRAIN Iteration I TRAIN Iteration I+1 RVs Subset 0 Subset J

Alphadigit Recognition Data increased to 10000 training vectors Data increased to 10000 training vectors Reduction method has been trained up to 100k vectors (on toy task). Not possible for Constructive method Reduction method has been trained up to 100k vectors (on toy task). Not possible for Constructive method ApproachErrorRate Avg. # Parameters Training Time Testing Time SVM15.5%994 3 hours 1.5 hours RVMConstructive14.8%72 5 days 5 mins RVMReduction14.8%74 6 days 5 mins

Summary First to apply kernel machines as acoustic models First to apply kernel machines as acoustic models Comparison of two machines that apply structural optimization to learning: SVM and RVM Comparison of two machines that apply structural optimization to learning: SVM and RVM Performance exceeds that of HMM but with quite a bit of HMM interaction Performance exceeds that of HMM but with quite a bit of HMM interaction Algorithms for increased data sizes are key Algorithms for increased data sizes are key

Decoupling the HMM Still want to use segmental data (data size) Still want to use segmental data (data size) Want the kernel machine acoustic model to determine an optimal segmentation though Want the kernel machine acoustic model to determine an optimal segmentation though Need a new decoder Need a new decoder Hypothesize each phone for each possible segment Hypothesize each phone for each possible segment Pruning is a huge issue Pruning is a huge issue Stack decoder is beneficial Stack decoder is beneficial Status: In development Status: In development

Improved Iterative Algorithm Same principle of operation Same principle of operation One pass over the data – much faster! One pass over the data – much faster! Status: Equivalent performance on all benchmarks – running on Alphadigits now Status: Equivalent performance on all benchmarks – running on Alphadigits now Candidate Pool Candidate Pool TRAIN Subset 0 Subset 1 RVs

Active Learning for RVMs Idea: Given the current model, iteratively chooses a subset of points from the full training set that will improve the system performance Idea: Given the current model, iteratively chooses a subset of points from the full training set that will improve the system performance Problem #1: “Performance” typically is defined as classifier error rate (e.g. boosting). What about the posterior estimate accuracy? Problem #1: “Performance” typically is defined as classifier error rate (e.g. boosting). What about the posterior estimate accuracy? Problem #2: For kernel machines, an added training point can: Problem #2: For kernel machines, an added training point can: Assist in bettering the model performance Assist in bettering the model performance Become part of the model itself! How do we determine which points should be added? Become part of the model itself! How do we determine which points should be added? Look to work in Gaussian Processes (Lawrence, Seeger, Herbrich, 2003) Look to work in Gaussian Processes (Lawrence, Seeger, Herbrich, 2003)

Extensions Not ready for prime time as an acoustic model Not ready for prime time as an acoustic model How else might we use the same techniques for speech? How else might we use the same techniques for speech? Online Speech/Noise Classification? Online Speech/Noise Classification? Requires adaptation methods Requires adaptation methods Application of automatic relevance determination to model selection for HMMs? Application of automatic relevance determination to model selection for HMMs?

Acknowledgments Collaborators: Aravind Ganapathiraju and Joe Picone at Mississippi State Collaborators: Aravind Ganapathiraju and Joe Picone at Mississippi State Consultants: Michael Tipping (MSR-Cambridge) and Thorsten Joachims (now at Cornell) Consultants: Michael Tipping (MSR-Cambridge) and Thorsten Joachims (now at Cornell)

A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker Institute for Signal and Information Processing.

Similar presentations

Presentation on theme: "A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker Institute for Signal and Information Processing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker Institute for Signal and Information Processing.

Similar presentations

Presentation on theme: "A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker Institute for Signal and Information Processing."— Presentation transcript:

Similar presentations

About project

Feedback