Extended Baum-Welch algorithm Present by shih-hung Liu 20060121.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

Learning HMM parameters
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang Speaker : 郭人瑋 Main reference:
Hidden Markov Models Eine Einführung.
Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University.
Hidden Markov Models in NLP
Optimization in Engineering Design 1 Lagrange Multipliers.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Expectation-Maximization
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Isolated-Word Speech Recognition Using Hidden Markov Models
Chapter 3 Linear Programming Methods 高等作業研究 高等作業研究 ( 一 ) Chapter 3 Linear Programming Methods (II)
Nonhomogeneous Linear Differential Equations
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Chapter 6 Linear Programming: The Simplex Method
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Barnett/Ziegler/Byleen Finite Mathematics 11e1 Learning Objectives for Section 6.4 The student will be able to set up and solve linear programming problems.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Solving Linear Programming Problems: The Simplex Method
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Discriminative Training Approaches for Continuous Speech Recognition Berlin Chen, Jen-Wei Kuo, Shih-Hung Liu Speech Lab Graduate Institute of Computer.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Definitions Random Signal Analysis (Review) Discrete Random Signals Random.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Chapter 6 Linear Programming: The Simplex Method Section 4 Maximization and Minimization with Problem Constraints.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
Mathe III Lecture 7 Mathe III Lecture 7. 2 Second Order Differential Equations The simplest possible equation of this type is:
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
2D-LDA: A statistical linear discriminant analysis for image matrix
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
Chance Constrained Robust Energy Efficiency in Cognitive Radio Networks with Channel Uncertainty Yongjun Xu and Xiaohui Zhao College of Communication Engineering,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.
7 th Grade Math Vocabulary Word, Definition, Model Emery Unit 2.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Extended Baum-Welch algorithm
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Expectation-Maximization
LECTURE 15: REESTIMATION, EM AND MIXTURES
Introduction to HMM (cont)
Qiang Huo(*) and Chorkin Chan(**)
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Extended Baum-Welch algorithm Present by shih-hung Liu

NTNU Speech Lab. 2 References A generalization of the Baum algorithm to rational objective function - [Gopalakrishnan et al.] IEEE ICASP 1989 An inequality for rational function with applications to some statistical estimation problems [Gopalakrishnan et al.] - IEEE Transactions on Information Theory 1991 HMMs, MMIE, and the Speech Recognition problem - [Normandin 1991] PhD dissertation Function maximization - [Povey 2004] PhD thesis chapter 4.5

NTNU Speech Lab. 3 Outline Introduction Extended Baum-Welch algorithm [Gopalakrishnan et al.] EBW from discrete to continuous [Normandin] EBW for discrete [Povey] Example of function optimization [Gopalakrishnan et al.] Conclusion

NTNU Speech Lab. 4 Introduction The well-known Baum-Eagon inequality provides an effective iterative scheme for finding a local maximum for homogeneous polynomials with positive coefficients over a domain of probability values However, we are interesting in maximizing a general rational function. We extend the Baum-Eagon inequality to rational function

NTNU Speech Lab. 5 Extended Baum-Welch algorithm (1/6) an arbitrary homogeneous polynomial with nonnegative coefficient of degree d in variables Assuming that this polynomial is defined over a domain of probability values, they show how to construct a transformation for some such that following the property: property A : for any and, unless [Gopalakrishnan 1989]

NTNU Speech Lab. 6 Extended Baum-Welch algorithm (2/6) is a ratio of two polynomials in variables defined over a domain we are looking for a growth transformation such that for any and, unless A reduction of the case of rational function to polynomial we reduce the problem of finding a growth transformation for a rational function to of finding that for a specially formed polynomial reduce to Non-homogeneous polynomial with nonnegative Extend Baum-Eagon inequality to Non-homogeneous polynomial with nonnegative [Gopalakrishnan 1989]

NTNU Speech Lab. 7 Extended Baum-Welch algorithm (3/6) Step1: [Gopalakrishnan 1989]

NTNU Speech Lab. 8 Extended Baum-Welch algorithm (4/6) Step2: [Gopalakrishnan 1989]

NTNU Speech Lab. 9 Extended Baum-Welch algorithm (5/6) Step3: finding a growth transformation for a polynomial with nonnegative coefficients can be reduce to the same problem for a homogeneous polynomial with nonnegative coefficients [Gopalakrishnan 1989] 1

NTNU Speech Lab. 10 Extended Baum-Welch algorithm (6/6) Baum-Eagon inequality: [Gopalakrishnan 1989]

NTNU Speech Lab. 11 EBW for CDHMM – from discrete to continuous (1/3) Discrete case for emission probability update [ Normandin 1991 ]

NTNU Speech Lab. 12 jjj EBW for CDHMM – from discrete to continuous (2/3) [ Normandin 1991 ] M subintervals I k of width

NTNU Speech Lab. 13 EBW for CDHMM – from discrete to continuous (3/3) [ Normandin 1991 ] EBW

NTNU Speech Lab. 14 EBW for discrete HMMs (1/6) The Baum-Eagon inequality is formulated for the case where there are variables in a matrix containing rows with a sum-to-one constraint, and we are maximizing a sum of polynomial terms in with nonnegative coefficient For ML training, we can find an auxiliary function and optimize it Finding the maximum of the auxiliary function (e.g. using lagrangian multiplier) leads to the following update, which is a growth transformation for the polynomial: [Povey 2004]

NTNU Speech Lab. 15 EBW for discrete HMMs (2/6) The Baum-Welch update is an update procedure for HMMs which uses this growth transformation together with an algorithm known as the forward-backward algorithm for finding the relevant differentials efficiently [Povey 2004]

NTNU Speech Lab. 16 EBW for discrete HMMs (3/6) An update rule as convenient and provable correct as the Baum- Welch update is not available for discriminative training of HMMs, which is a harder optimization problem The Extended Baum-Welch update equation as originally derived is applicable to rational function of parameters which are subject to sum-to-one constraints The MMI objective function for discrete-probability HMMs is an example of such a function [Povey 2004]

NTNU Speech Lab. 17 EBW for discrete HMMs (4/6) Instead of maximizing for positive and,we can instead maximize where and are the value of previous iteration ; increasing will cause to increase this is because is a strong sense auxiliary function for around 2. If some terms in the resulting polynomial are negative, we can add to the expression a constant C times a further polynomial which is constrained to be a constant (e.g. ), so as to ensure that no product of terms in the final expression has a negative coefficient [Povey 2004] 1. two essential points used to derive the EBW update for MMI

NTNU Speech Lab. 18 EBW for discrete HMMs (5/6) [Povey 2004] By applying these two ideas :

NTNU Speech Lab. 19 EBW equivalent smooth function (6/6) [Povey 2004]

NTNU Speech Lab. 20 Example consider C

NTNU Speech Lab. 21 Example

NTNU Speech Lab. 22 Conclusion Presented an algorithm for maximization of certain rational function define over domain of probability values This algorithm is very useful in practical situation for training HMMs parameters

NTNU Speech Lab. 23 MPE: Final Auxiliary Function weak-sense auxiliary function strong-sense auxiliary function smoothing function involved weak-sense auxiliary function

NTNU Speech Lab. 24 EBW derived from auxiliary function

NTNU Speech Lab. 25 EBW derived from auxiliary function