Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

Lecture 9. Learning in Bayesian Networks Learning via Global Optimization of a Criterion Maximum-likelihood learning –The Expectation Maximization algorithm –Solution for discrete variables using Lagrangian multipliers –General solution for continuous variables –Example: Gaussian PDF –Example: Mixture Gaussian –Example: Bourlard-Morgan NN-DBN Hybrid –Example: BDFK NN-DBN Hybrid Discriminative learning criteria –Maximum Mutual Information –Minimum Classification Error

What is Learning? Imagine that you are a student who needs to learn how to propagate belief in a junction tree. Level 1 Learning (Rule-Based): I tell you the algorithm. You memorize it. Level 2 Learning (Category Formation): You observe examples (FHMM). You memorize them. From the examples, you build a cognitive model of each of the steps (moralization, triangulation, cliques, sum-product). Level 3 Learning (Performance): You try a few problems. When you fail, you optimize your understanding of all components of the cognitive model in order to minimize the probability of future failures.

What is Machine Learning? Level 1 Learning (Rule-Based): Programmer tells the computer how to behave. This is not usually called “machine learning.” Level 2 Learning (Category Formation): The program is given a numerical model of each category (e.g., a PDF, or a geometric model). Parameters of the numerical model are adjusted in order to represent the category. Level 3 Learning (Performance): All parameters in a complex system are simultaneously adjusted in order to optimize a global performance metric.

Learning Criteria

Optimization Methods

Maximum Likelihood Learning in a Dynamic Bayesian Network Given: a particular model structure Given: a set of training examples for that model, (b m,o m ), 1≤m≤M Estimate all model parameters (p (b|a), p (c|a),…) in order to maximize  m log p(b m,o m | ) Recognition is Nested within Training: at each step of the training algorithm, we need to compute p(b m,o m,a m,…,q m ) for every training token, using sum-product algorithm. a bc efd no q o b

Baum’s Theorem (Baum and Eagon, Bull. Am. Math. Soc., 1967)

Expectation Maximization (EM)

EM for a Discrete-Variable Bayesian Network a bc efd no q o b

a bc efd no q o b

Solution: Lagrangian Method

The EM Algorithm for a Large Training Corpus

EM for Continuous Observations (Liporace, IEEE Trans. Inf. Th., 1982)

Solution: Lagrangian Method

Example: Gaussian (Liporace, IEEE Trans. Inf. Th., 1982)

Example: Mixture Gaussian (Juang, Levinson, and Sondhi, IEEE Trans. Inf. Th., 1986)

Example: Bourlard-Morgan Hybrid (Morgan and Bourlard, IEEE Sign. Proc. Magazine 1995)

Pseudo-Priors and Training Priors

Training the Hybrid Model Using the EM Algorithm

The Solution: Q Back-Propagation

Merging the EM and Gradient Ascent Loops

Example: BDFK Hybrid (Bengio, De Mori, Flammia, and Kompe, Spe. Comm. 1992)

The Q Function for a BDFK Hybrid

The EM Algorithm for a BDFK Hybrid

Discriminative Learning Criteria

Maximum Mutual Information

An EM-Like Algorithm for MMI

MMI for Databases with Different Kinds of Transcription If every word’s start and end times are labeled, then WT is the true word label, and W* is the label of the false word (or words!) with maximum modeled probability. If the start and times of individual word strings are not known, then WT is the true word sequence. W* may be computed as the best path (or paths) through a word lattice or N-best list. (Schlüter, Macherey, Müller, and Ney, Spe. Comm. 2001)

Minimum Classification Error (McDermott and Katagiri, Comput. Speech Lang. 1994) Define empirical risk as “the number of word tokens for which the wrong HMM has higher log-likelihood than the right HMM” This risk definition has two nonlinearities: –Zero-one loss function, u(x). Replace with a differentiable loss function,  (x). –Max. Replace with a “softmax” function, log(exp(a)+exp(b)+exp(c)). Differentiate the result; train all HMM parameters using error backpropagation.

Summary What is Machine Learning? –choose an optimality criterion, –find an algorithm that will adjust model parameters to optimize the criterion Maximum Likelihood –Baum’s theorem: argmax E[log(p)] = argmax[p] –Apply directly to discrete, Gaussian, MG –Nest within EBP for BM and BDFK hybrids Discriminative Criteria –Maximum Mutual Information (MMI) –Minimum Classification Error (MCE)

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson

Similar presentations

Presentation on theme: "Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson

Similar presentations

Presentation on theme: "Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson"— Presentation transcript:

Similar presentations

About project

Feedback