Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Introduction to Support Vector Machines (SVM)
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Pattern Recognition and Machine Learning
Support Vector Machines
Support vector machine
Supervised Learning Recap
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Chapter 4: Linear Models for Classification
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Speaker Adaptation for Vowel Classification
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
... NOT JUST ANOTHER PUBLIC DOMAIN SOFTWARE PROJECT ...
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Outline Separating Hyperplanes – Separable Case
This week: overview on pattern recognition (related to machine learning)
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State.
Joseph Picone, PhD Human and Systems Engineering Professor, Electrical and Computer Engineering Bridging the Gap in Human and Machine Performance HUMAN.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for.
Joseph Picone, PhD Human and Systems Engineering Professor, Electrical and Computer Engineering URL:
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author: Aravind.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
Christopher M. Bishop, Pattern Recognition and Machine Learning.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker Institute for Signal and Information Processing.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
Support Vector Machines Tao Department of computer science University of Illinois.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
NTU & MSRA Ming-Feng Tsai
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Sparse Kernel Machines
LECTURE 16: SUPPORT VECTOR MACHINES
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 17: SUPPORT VECTOR MACHINES
Presentation transcript:

Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University Contact Information: Box 9571 Mississippi State University Mississippi State, Mississippi Tel: Fax: IBM – SIGNAL PROCESSING URL: Acknowledgement: Supported by NSF under Grant No. EIA

INTRODUCTION ABSTRACT AND BIOGRAPHY ABSTRACT: Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have dominated the signal processing and pattern recognition literature for the past 20 years. However, HMMs suffer from an inability to learn discriminative information and are prone to overfitting and over ‑ parameterization. In this presentation, we will review our attempts to apply notions of risk minimization into pattern recognition problems such as speech recognition. New approaches based on probabilistic Bayesian learning are shown to provide an order of magnitude reduction in complexity over comparable approaches based on HMMs and Support Vector Machines. BIOGRAPHY: Joseph Picone is currently a Professor in the Department of Electrical and Computer Engineering at Mississippi State University, where he also directs the Institute for Signal and Information Processing. For the past 15 years he has been promoting open source speech technology. He has previously been employed by Texas Instruments and AT&T Bell Laboratories. Dr. Picone received his Ph.D. in Electrical Engineering from Illinois Institute of Technology in He is a Senior Member of the IEEE and a registered Professional Engineer.

INTRODUCTION GENERALIZATION AND RISK Optimal decision surface is a line Optimal decision surface changes abruptly Optimal decision surface still a line How much can we trust isolated data points? Can we integrate prior knowledge about data, confidence, or willingness to take risk?

INTRODUCTION ACOUSTIC CONFUSABILITY Regions of overlap represent classification error Reduce overlap by introducing acoustic and linguistic context Comparison of “aa” in “lOck” and “iy” in “bEAt” for conversational speech

INTRODUCTION PROBABILISTIC FRAMEWORK

Maximum likelihood convergence does not translate to optimal classification if a priori assumptions about the data are not correct. Finding the optimal decision boundary requires only one parameter. INTRODUCTION ML CONVERGENCE NOT OPTIMAL

INTRODUCTION POOR GENERALIZATION WITH GMM MLE Data is often not separable by a hyperplane – nonlinear classifier is needed Gaussian MLE models tend toward the center of mass – overtraining leads to poor generalization Three problems: controlling generalization, direct discriminative training, and sparsity.

RISK MINIMIZATION DISCRIMINATIVE TRAINING Several popular discriminative training approaches (e.g., maximum mutual information estimation) Essential Idea: Maximize Maximize numerator (ML term), minimize denominator (discriminative term) Previously developed for neural networks, hybrid systems, and eventually HMM-based speech recognition systems

Structural optimization often guided by an Occam’s Razor approach Trading goodness of fit and model complexity –Examples: MDL, BIC, AIC, Structural Risk Minimization, Automatic Relevance Determination RISK MINIMIZATION Model Complexity Error Training Set Error Open-Loop Error Optimum STRUCTURAL OPTIMIZATION

RISK MINIMIZATION STRUCTURAL RISK MINIMIZATION The VC dimension is a measure of the complexity of the learning machine Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik) Expected Risk: Not possible to estimate P(x,y) Empirical Risk: Related by the VC dimension, h: Approach: choose the machine that gives the least upper bound on the actual risk VC confidence empirical risk bound on the expected risk VC dimension Expected risk optimum

RISK MINIMIZATION Optimization: Separable Data Hyperplane: Constraints: Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors. Final classifier: origin class 1 class 2 w H1 H2 C1 CO C2 optimal classifier Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally The data points that define the boundary are called support vectors SUPPORT VECTOR MACHINES

RISK MINIMIZATION SVMS FOR NON-SEPARABLE DATA No hyperplane could achieve zero empirical risk (in any dimension space!) Recall the SRM Principle: balance empirical risk and model complexity Relax our optimization constraint to allow for errors on the training set: A new parameter, C, must be estimated to optimally control the trade-off between training set errors and model complexity

RISK MINIMIZATION DRAWBACKS OF SVMS Uses a binary (yes/no) decision rule  Generates a distance from the hyperplane, but this distance is often not a good measure of our “confidence” in the classification  Can produce a “probability” as a function of the distance (e.g. using sigmoid fits), but they are inadequate Number of support vectors grows linearly with the size of the data set Requires the estimation of trade-off parameter, C, via held-out sets

Build a fully specified probabilistic model – incorporate prior information/beliefs as well as a notion of confidence in predictions MacKay posed a special form for regularization in neural networks – sparsity Evidence maximization: evaluate candidate models based on their “evidence”, P(D|H i ) Structural optimization by maximizing the evidence across all candidate models Steeped in Gaussian approximations RELEVANCE VECTOR MACHINES EVIDENCE MAXIMIZATION

Penalty that measures how well our posterior model fits our prior assumptions: We can set the prior in favor of sparse, smooth models Evidence approximation: Likelihood of data given best fit parameter set:  w w  w P(w|D,H i ) P(w|H i ) RELEVANCE VECTOR MACHINES EVIDENCE FRAMEWORK

A kernel-based learning machine Incorporates an automatic relevance determination (ARD) prior over each weight (MacKay) A flat (non-informative) prior over  completes the Bayesian specification RELEVANCE VECTOR MACHINES AUTOMATIC RELEVANCE DETERMINATION

The goal in training becomes finding: Estimation of the “sparsity” parameters is inherent in the optimization – no need for a held-out set! A closed-form solution to this maximization problem is not available. Iteratively reestimate RELEVANCE VECTOR MACHINES ITERATIVE REESTIMATION

Fix  and estimate w (e.g. gradient descent) Use the Hessian to approximate the covariance of a Gaussian posterior of the weights centered at With and as the mean and covariance, respectively, of the Gaussian approximation, we find by finding Method is O(N 2 ) in memory and O(N 3 ) in time RELEVANCE VECTOR MACHINES LAPLACE’S METHOD

RVM: Data: Class labels (0,1) Goal: Learn posterior, P(t=1|x) Structural Optimization: Hyperprior distribution encourages sparsity Training: iterative O(N 3 ) SVM: Data: Class labels (-1,1) Goal: Find optimal decision surface under constraints Structural Optimization: Trade-off parameter that must be estimated Training: Quadratic O(N 2 ) RELEVANCE VECTOR MACHINES COMPARISON TO SVMS

Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 SI test Approach% Error# Parameters SVM: Polynomial Kernels49% K-Nearest Neighbor44% Gaussian Node Network44% SVM: RBF Kernels35%83 SVs Separable Mixture Models30% RVM: RBF Kernels30%13 RVs EXPERIMENTAL RESULTS DETERDING VOWEL DATA

Data size: –30 million frames of data in training set –Solution: Segmental phone models Source for Segmental Data: –Solution: Use HMM system in bootstrap procedure –Could also build a segment- based decoder Probabilistic decoder coupling: –SVMs: Sigmoid-fit posterior –RVMs: naturally probabilistic EXPERIMENTAL RESULTS INTEGRATION WITH SPEECH RECOGNITION hhawaaryuw region 1 0.3*k frames region 3 0.3*k frames region 2 0.4*k frames mean region 1mean region 2mean region 3 k frames

SEGMENTAL CONVERTER SEGMENTAL CONVERTER HMM RECOGNITION HMM RECOGNITION HYBRID DECODER HYBRID DECODER Features (Mel-Cepstra)) Segment Information N-best List Segmental Features Hypothesis EXPERIMENTAL RESULTS HYBRID DECODER

HMM system is cross-word state-tied triphones with 16 mixtures of Gaussian models SVM system has monophone models with segmental features System combination experiment yields another 1% reduction in error EXPERIMENTAL RESULTS SVM ALPHADIGIT RECOGNITION TranscriptionSegmentationSVMHMM N-bestHypothesis11.0%11.9% N-best+RefReference3.3%6.3%

RVMs yield a large reduction in the parameter count while attaining superior performance Computational costs mainly in training for RVMs but is still prohibitive for larger sets ApproachError Rate Avg. # Parameters Training Time Testing Time SVM16.4% hours30 mins RVM16.2%1230 days1 min EXPERIMENTAL RESULTS SVM/RVM ALPHADIGIT COMPARISON

SUMMARY PRACTICAL RISK MINIMIZATION? Reduction of complexity at the same level of performance is interesting: Results hold across tasks RVMs have been trained on 100,000 vectors Results suggest integrated training is critical Risk minimization provides a family of solutions: Is there a better solution than minimum risk? What is the impact on complexity and robustness? Applications to other problems? Speech/Non-speech classificiation? Speaker adaptation? Language modeling?

APPENDIX SCALING RVMS TO LARGE DATA SETS Central to RVM training is the inversion of an MxM Hessian matrix: an O(N 3 ) operation initially Solutions: –Constructive Approach: Start with an empty model and iteratively add candidate parameters. M is typically much smaller than N –Divide and Conquer Approach: Divide complete problem into set of sub-problems. Iteratively refine the candidate parameter set according to sub-problem solution. M is user-defined

APPENDIX PRELIMINARY RESULTS Approach Error Rate Avg. # Parameters Training Time Testing Time SVM 15.5%9943 hours1.5 hours RVM Constructive 14.8%725 days5 mins RVM Reduction 14.8%746 days5 mins Data increased to training vectors Reduction method has been trained up to 100k vectors (on toy task). Not possible for Constructive method

Principal Investigators: Aravind Ganapathiraju (Conversay) and Jon Hamaker (Microsoft) as part of their Ph.D. studies at Mississippi State Consultants: Michael Tipping (MSR-Cambridge) and Thorsten Joachims (Cornell) Motivation: Serious work began after discussions with V.N. Vapnik at the CLSP Summer Workshop in SUMMARY ACKNOWLEDGEMENTS

SUMMARY RELEVANT SOFTWARE RESOURCES Pattern Recognition Applet: compare popular algorithms on standard or custom data sets Speech Recognition Toolkits: compare SVMs and RVMs to standard approaches using a state of the art ASR toolkit Fun Stuff: have you seen our commercial on the Home Shopping Channel? Foundation Classes: generic C++ implementations of many popular statistical modeling approaches

SUMMARY BRIEF BIBLIOGRAPHY Applications to Speech Recognition: 1.J. Hamaker and J. Picone, “Advances in Speech Recognition Using Sparse Bayesian Methods,” submitted to the IEEE Transactions on Speech and Audio Processing, January 2003.Advances in Speech Recognition Using Sparse Bayesian Methods 2.A. Ganapathiraju, J. Hamaker and J. Picone, “Applications of Risk Minimization to Speech Recognition,” submitted to the IEEE Transactions on Signal Processing, July 2003.Applications of Risk Minimization to Speech Recognition 3.J. Hamaker, J. Picone, and A. Ganapathiraju, “A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines,” Proceedings of the International Conference of Spoken Language Processing, vol. 2, pp , Denver, Colorado, USA, September 2002.A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines 4.J. Hamaker, Sparse Bayesian Methods for Continuous Speech Recognition, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, December 2003.Sparse Bayesian Methods for Continuous Speech Recognition 5.A. Ganapathiraju, Support Vector Machines for Speech Recognition, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, January 2002.Support Vector Machines for Speech Recognition Influential work: 6.M. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning, vol. 1, pp , June D. J. C. MacKay, “Probable networks and plausible predictions --- a review of practical Bayesian methods for supervised neural networks,” Network: Computation in Neural Systems, 6, pp , D. J. C. MacKay, Bayesian Methods for Adaptive Models, Ph. D. thesis, California Institute of Technology, Pasadena, California, USA, E. T. Jaynes, “Bayesian Methods: General Background,” Maximum Entropy and Bayesian Methods in Applied Statistics, J. H. Justice, ed., pp. 1-25, Cambridge Univ. Press, Cambridge, UK, V.N. Vapnik, Statistical Learning Theory, John Wiley, New York, NY, USA, V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, NY, USA, C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” AT&T Bell Laboratories, November 1999.