Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson University of Illinois at Urbana-Champaign, USA
Lecture 9. Learning in Bayesian Networks Learning via Global Optimization of a Criterion Maximum-likelihood learning –The Expectation Maximization algorithm –Solution for discrete variables using Lagrangian multipliers –General solution for continuous variables –Example: Gaussian PDF –Example: Mixture Gaussian –Example: Bourlard-Morgan NN-DBN Hybrid –Example: BDFK NN-DBN Hybrid Discriminative learning criteria –Maximum Mutual Information –Minimum Classification Error
What is Learning? Imagine that you are a student who needs to learn how to propagate belief in a junction tree. Level 1 Learning (Rule-Based): I tell you the algorithm. You memorize it. Level 2 Learning (Category Formation): You observe examples (FHMM). You memorize them. From the examples, you build a cognitive model of each of the steps (moralization, triangulation, cliques, sum-product). Level 3 Learning (Performance): You try a few problems. When you fail, you optimize your understanding of all components of the cognitive model in order to minimize the probability of future failures.
What is Machine Learning? Level 1 Learning (Rule-Based): Programmer tells the computer how to behave. This is not usually called “machine learning.” Level 2 Learning (Category Formation): The program is given a numerical model of each category (e.g., a PDF, or a geometric model). Parameters of the numerical model are adjusted in order to represent the category. Level 3 Learning (Performance): All parameters in a complex system are simultaneously adjusted in order to optimize a global performance metric.
Learning Criteria
Optimization Methods
Maximum Likelihood Learning in a Dynamic Bayesian Network Given: a particular model structure Given: a set of training examples for that model, (b m,o m ), 1≤m≤M Estimate all model parameters (p (b|a), p (c|a),…) in order to maximize m log p(b m,o m | ) Recognition is Nested within Training: at each step of the training algorithm, we need to compute p(b m,o m,a m,…,q m ) for every training token, using sum-product algorithm. a bc efd no q o b
Baum’s Theorem (Baum and Eagon, Bull. Am. Math. Soc., 1967)
Expectation Maximization (EM)
EM for a Discrete-Variable Bayesian Network a bc efd no q o b
a bc efd no q o b
Solution: Lagrangian Method
The EM Algorithm for a Large Training Corpus
EM for Continuous Observations (Liporace, IEEE Trans. Inf. Th., 1982)
Solution: Lagrangian Method
Example: Gaussian (Liporace, IEEE Trans. Inf. Th., 1982)
Example: Mixture Gaussian (Juang, Levinson, and Sondhi, IEEE Trans. Inf. Th., 1986)
Example: Bourlard-Morgan Hybrid (Morgan and Bourlard, IEEE Sign. Proc. Magazine 1995)
Pseudo-Priors and Training Priors
Training the Hybrid Model Using the EM Algorithm
The Solution: Q Back-Propagation
Merging the EM and Gradient Ascent Loops
Example: BDFK Hybrid (Bengio, De Mori, Flammia, and Kompe, Spe. Comm. 1992)
The Q Function for a BDFK Hybrid
The EM Algorithm for a BDFK Hybrid
Discriminative Learning Criteria
Maximum Mutual Information
An EM-Like Algorithm for MMI
MMI for Databases with Different Kinds of Transcription If every word’s start and end times are labeled, then WT is the true word label, and W* is the label of the false word (or words!) with maximum modeled probability. If the start and times of individual word strings are not known, then WT is the true word sequence. W* may be computed as the best path (or paths) through a word lattice or N-best list. (Schlüter, Macherey, Müller, and Ney, Spe. Comm. 2001)
Minimum Classification Error (McDermott and Katagiri, Comput. Speech Lang. 1994) Define empirical risk as “the number of word tokens for which the wrong HMM has higher log-likelihood than the right HMM” This risk definition has two nonlinearities: –Zero-one loss function, u(x). Replace with a differentiable loss function, (x). –Max. Replace with a “softmax” function, log(exp(a)+exp(b)+exp(c)). Differentiate the result; train all HMM parameters using error backpropagation.
Summary What is Machine Learning? –choose an optimality criterion, –find an algorithm that will adjust model parameters to optimize the criterion Maximum Likelihood –Baum’s theorem: argmax E[log(p)] = argmax[p] –Apply directly to discrete, Gaussian, MG –Nest within EBP for BM and BDFK hybrids Discriminative Criteria –Maximum Mutual Information (MMI) –Minimum Classification Error (MCE)