Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University.

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Support Vector Machine

Support Vector Machines

Support vector machine

Discriminative, Unsupervised, Convex Learning Dale Schuurmans Department of Computing Science University of Alberta MITACS Workshop, August 26, 2005.

Chapter 4: Linear Models for Classification

Extended Baum-Welch algorithm Present by shih-hung Liu

Hidden Markov Models Theory By Johan Walters (SR 2003)

Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.

Visual Recognition Tutorial

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Speaker Adaptation for Vowel Classification

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

Machine Learning CMPT 726 Simon Fraser University

Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.

Linear Discriminant Functions Chapter 5 (Duda et al.)

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Today Wrap up of probability Vectors, Matrices. Calculus

An Introduction to Support Vector Machines Martin Law.

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Isolated-Word Speech Recognition Using Hidden Markov Models

Presented by: Fang-Hui, Chu Automatic Speech Recognition Based on Weighted Minimum Classification Error Training Method Qiang Fu, Biing-Hwang Juang School.

SVM by Sequential Minimal Optimization (SMO)

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

An Introduction to Support Vector Machines (M. Law)

Power Linear Discriminant Analysis (PLDA) M. Sakai, N. Kitaoka and S. Nakagawa, “Generalization of Linear Discriminant Analysis Used in Segmental Unit.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Non-Bayes classifiers. Linear discriminants, neural networks.

Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Linear Models for Classification

CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

Support Vector Machines

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.

Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.

SVMs in a Nutshell.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Extended Baum-Welch algorithm

Statistical Models for Automatic Speech Recognition

Classification Discriminant Analysis

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Classification Discriminant Analysis

Statistical Models for Automatic Speech Recognition

Support Vector Machines

ICASSP 2005 Survey Discriminative Training (6 papers)

SVMs for Document Ranking

Discriminative Training

Presentation transcript:

Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

2 Reference Papers [Xinwei Li] [M.S. thesis] [Sep. 2005], “Large Margin HMMs for SR” [Xinwei Li] [ICASSP 05], “Large Margin HMMs for SR” [Chaojun Liu] [ICASSP 05], “Discriminative training of CDHMMs for Maximum Relative Separation Margin” [Xinwei Li] [ASRU 05], “A constrained joint optimization method for LME” [Hui Jiang] [SAP 2006], “Large Margin HMMs for SR” [Jinyu Li] [ICSLP 06], “Soft Margin Estimation of HMM parameters”

3 Outline Large Margin HMMs Analysis of Margin in CDHMM Optimization methods for Large Margin HMMs estimation Soft Margin Estimation for HMM

4 Large Margin HMMs for ASR In ASR, given any speech utterance Χ, a speech recognizer will choose the word Ŵ as output based on the plug-in MAP decision rule as follows: For a speech utterance Xi, assuming its true word identity as Wi, the multiclass separation margin for Xi is defined as Discriminant function Ω denotes the set of all possible words

5 Large Margin HMMs for ASR According to the statistical learning theory [Vapnik], the generalization error rate of a classifier in new test sets is theoretically bounded by a quantity related to its margin Motivated by the large margin principle, even for those utterances in the training set which all have positive margin, we may still want to maximize the minimum margin to build an HMM-based large margin classifier for ASR

6 Large Margin HMMs for ASR Given a set of training data D = { X 1, X 2,…,X T }, we usually know the true word identities for all utterances in D, denoted as L = {W 1, W 2,…,W T } First, from all utterances in D, we need to identify a subset of utterances S as We call S as support vector set and each utterance in S is called a support token which has relatively small positive margin among all utterances in the training set D where ε> 0 is a preset positive number

7 Large Margin HMMs for ASR This idea leads to estimating the HMM models Λ based on the criterion of maximizing the minimum margin of all support tokens, which is named as large margin estimation (LME) of HMM The HMM models,, estimated in this way, are called large margin HMMs

8 Analysis of Margin in CDHMM Adopt the Viterbi method to approximate the summation with the single optimal Viterbi path, the discriminant function can be expressed as

9 Analysis of Margin in CDHMM Here, we only consider to estimate mean vectors In this case, the discriminant functions can be represented as a summation of some quadratic terms related to mean values of CDHMMs

10 Analysis of Margin in CDHMM As a result, the decision margin can be represent as a standard diagonal quadratic form Thus, for each feature vector x it, we can divide all of its dimensions into two parts: we can see that each feature dimension contributes to the decision margin separately

11 Analysis of Margin in CDHMM After some math manipulation, we have: linear functionquadratic function

12 Analysis of Margin in CDHMM

13 Analysis of Margin in CDHMM

14 Analysis of Margin in CDHMM

15 Optimization methods for LM HMM estimation An iterative localized optimization method An constrained joint optimization method Semidefinite programming method

16 Iterative localized optimization In order to increase the margin unlimitedly while keeping the margins positive for all samples, both of the models must be moved together –if we keep one of the models fixed, the other model cannot be moved too far under the constraint that all samples must have positive margin –Otherwise the margin for some tokens will become negative Instead of optimizing parameters of all models at the same time, only one selected model will be adjusted in each step of optimization Then the process iterates to update another model until the optimal margin is achieved

17 Iterative localized optimization How to select the target model in each step? –The model should be relevant to the support token with the minimum margin The minimax optimization can be re-formulated as:

18 Iterative localized optimization Approximated by summation of exponential functions

19 Iterative localized optimization

20 Constrained Joint optimization Introduce some constraints to make the optimization problem bounded In this way, the optimization can be performed jointly with respect to all model parameters

21 Constrained Joint optimization In order to bound the margin contribution from the linear part: In order to bound the margin contribution from the quadratic part:

22 Constrained Joint optimization Reformulate the large margin estimation as the following constrained minimax optimization problem:

23 Constrained Joint optimization The constrained minimization problem can be transformed into an unconstrained minimization problem

24 Constrained Joint optimization

25 Soft Margin estimation Model separation measure and frame selection SME objective function and sample selection

26 Soft Margin estimation Difference between SME and LME –LME neglects the misclassified samples. Consequently, LME often needs a very good preliminary estimate from the training set –SME works on all the training data, both the correctly classified and misclassified samples –While SME must first choose a margin ρ heuristically