A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Introduction to Support Vector Machines (SVM)
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Protein Fold Recognition with Relevance Vector Machines Patrick Fernie COMS 6772 Advanced Machine Learning 12/05/2005.
Pattern Recognition and Machine Learning
Support Vector Machines
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Supervised Learning Recap
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Computer vision: models, learning and inference
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
Support Vector Machines and Kernel Methods
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Speaker Adaptation for Vowel Classification
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
CS Instance Based Learning1 Instance Based Learning.
... NOT JUST ANOTHER PUBLIC DOMAIN SOFTWARE PROJECT ...
Support Vector Machine Applications Electrical Load Forecasting ICONS Presentation Spring 2007 N. Sapankevych 20 April 2007.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.
Isolated-Word Speech Recognition Using Hidden Markov Models
Outline Separating Hyperplanes – Separable Case
This week: overview on pattern recognition (related to machine learning)
Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State.
Joseph Picone, PhD Human and Systems Engineering Professor, Electrical and Computer Engineering Bridging the Gap in Human and Machine Performance HUMAN.
Joseph Picone, PhD Human and Systems Engineering Professor, Electrical and Computer Engineering URL:
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author: Aravind.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
Christopher M. Bishop, Pattern Recognition and Machine Learning.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker Institute for Signal and Information Processing.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Seungchan Lee Department of Electrical and Computer Engineering Mississippi State University RVM Implementation Progress.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Introduction to Gaussian Process CS 478 – INTRODUCTION 1 CS 778 Chris Tensmeyer.
Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Sparse Kernel Machines
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Computer vision: models, learning and inference
LECTURE 16: SUPPORT VECTOR MACHINES
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 17: SUPPORT VECTOR MACHINES
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Linear Discrimination
Presentation transcript:

A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for Signal and Information Processing Mississippi State University Aravind Ganapathiraju Speech Scientist Conversay Computing Corporation This material is based upon work supported by the National Science Foundation under Grant No.IIS Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

MOTIVATION Acoustic Confusability: Requires reasoning under uncertainty! Regions of overlap represent classification error Reduce overlap by introducing acoustic and linguistic context. Comparison of “aa” in “lOck” and “iy” in “bEAt” for SWB

ACOUSTIC MODELS Acoustic Models Must: Model the temporal progression of the speech Model the characteristics of the sub-word units We would also like our models to: Optimally trade-off discrimination and representation Incorporate Bayesian statistics (priors) Make efficient use of parameters (sparsity) Produce confidence measures of their predictions for higher-level decision processes

SUPPORT VECTOR MACHINES Maximizes the margin between classes to satisfy SRM. Balances empirical risk and generalization. Training is carried out via quadratic optimization. Kernels provide the means for nonlinear classification. Many of the multipliers go to zero – yields sparse models.

DRAWBACKS OF SVMS Uses a binary decision rule –Can generate a distance, but on unseen data, this measure can be misleading –Can produce a “probability” using sigmoid fits, etc. but they are inadequate Number of support vectors grows linearly with the size of the data set Requires the estimation of trade-off parameters via held-out sets

A kernel-based learning machine Incorporates an automatic relevance determination (ARD) prior over each weight (MacKay) A flat (non-informative) prior over  completes the Bayesian specification. RELEVANCE VECTOR MACHINES

The goal in training becomes finding: Estimation of the “sparsity” parameters is inherent in the optimization – no need for a held-out set! A closed-form solution to this maximization problem is not available. Rather, we iteratively reestimate

LAPLACE’S METHOD Fix  and estimate w (e.g. gradient descent) Use the Hessian to approximate the covariance of a Gaussian posterior of the weights centered at With and as the mean and covariance, respectively, of the Gaussian approximation, we find by finding Method is O(N 2 ) in memory and O(N 3 ) in time

CONSTRUCTIVE TRAINING Central to this method is the inversion of an MxM hessian matrix: an O(N 3 ) operation initially Initial experiments could use only 2-3 thousand vectors Tipping and Faul have defined a constructive approach –Define – has a unique solution with respect to –The results give a set of rules for adding vectors to the model, removing vectors from the model or updating parameters in the model –Begin with all weights set to zero and iteratively construct an optimal model without evaluating the full NxN matrix.

STATIC CLASSIFICATION Deterding Vowel Data: 11 vowels spoken in “h*d” context. ApproachError Rate# Parameters K-Nearest Neighbor44% Gaussian Node Network44% SVM: Polynomial Kernels49% SVM: RBF Kernels35%83 SVs Separable Mixture Models30% RVM: RBF Kernels30%13 RVs

FROM STATIC CLASSIFICATION TO RECOGNITION SEGMENTAL CONVERTER SEGMENTAL CONVERTER HMM RECOGNITION HMM RECOGNITION HYBRID DECODER HYBRID DECODER Features (Mel-Cepstra) Segment Information N-best List Segmental Features Hypothesis

ALPHADIGIT RECOGNITION OGI Alphadigits: continuous, telephone bandwidth letters and numbers Reduced training set size for comparison: training vectors per phone model. –Results hold for sets of smaller size as well. –Can not yet run larger sets efficiently utterances using 10-best lists generated by the HMM decoder. SVM and RVM system architecture are nearly identical: RBF kernels with gamma = 0.5. –SVM requires the sigmoid posterior estimate to produce likelihoods.

ALPHADIGIT RECOGNITION RVMs yield a large reduction in the parameter count while attaining superior performance. Computational costs mainly in training for RVMs but is still prohibitive for larger sets. SVM performance on full training set is 11.0%. ApproachError Rate Avg. # Parameters Training Time Testing Time SVM15.5%9943 hours1.5 hours RVM14.8%725 days5 mins

CONCLUSIONS Application of sparse Bayesian methods to speech recognition. –Uses automatic relevance determination to eliminate irrelevant input vectors: Applications in maximum likelihood feature extraction? State-of-the-art performance in extremely sparse models. –Uses an order of magnitude fewer parameters than SVMs: Decreased evaluation time. –Requires several orders of magnitude longer to train: Need more efficient training routines that can handle continuous speech corpora.

CURRENT WORK Frame-level classification Convergence properties and efficient training methods are critical A “chunking” approach is in development –Apply the algorithm to small subsets of the basis functions –Combine results from each subset to reach a full solution –Optimality? E-Step Accumulation E-Step Accumulation M-Step RVM Training M-Step RVM Training Iterative Parameter Estimation HMMs with RVM Emission Distributions RVM(o t )

REFERENCES M. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning, vol. 1, pp , June A. Faul and M. Tipping, “Analysis of Sparse Bayesian Learning,” in T.G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14, pp , MIT Press, M. Tipping and A. Faul, “Fast Marginal Likelihood Maximization for Sparse Bayesian Models,” Artificial Intelligence and Statistics ’03, preprint, August, D.J.C. MacKay, “Probable Networks and Plausible Predictions --- A Review of Practical Bayesian Methods for Supervised Neural Networks,” Network: Computation in Neural Systems, vol. 6, pp , A. Ganapathiraju, Support Vector Machines for Speech Recognition, Ph.D. Dissertation, Mississippi State University, Mississippi State, Mississippi, USA, J. Hamaker, Sparse Bayesian Methods for Continuous Speech Recognition, Ph.D. Dissertation (preprint), Mississippi State University, Mississippi State, Mississippi, USA 2003.