Presentation is loading. Please wait.

Presentation is loading. Please wait.

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson

Similar presentations


Presentation on theme: "Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson"— Presentation transcript:

1 Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

2 Lecture 9. Learning in Bayesian Networks Learning via Global Optimization of a Criterion Maximum-likelihood learning –The Expectation Maximization algorithm –Solution for discrete variables using Lagrangian multipliers –General solution for continuous variables –Example: Gaussian PDF –Example: Mixture Gaussian –Example: Bourlard-Morgan NN-DBN Hybrid –Example: BDFK NN-DBN Hybrid Discriminative learning criteria –Maximum Mutual Information –Minimum Classification Error

3 What is Learning? Imagine that you are a student who needs to learn how to propagate belief in a junction tree. Level 1 Learning (Rule-Based): I tell you the algorithm. You memorize it. Level 2 Learning (Category Formation): You observe examples (FHMM). You memorize them. From the examples, you build a cognitive model of each of the steps (moralization, triangulation, cliques, sum-product). Level 3 Learning (Performance): You try a few problems. When you fail, you optimize your understanding of all components of the cognitive model in order to minimize the probability of future failures.

4 What is Machine Learning? Level 1 Learning (Rule-Based): Programmer tells the computer how to behave. This is not usually called “machine learning.” Level 2 Learning (Category Formation): The program is given a numerical model of each category (e.g., a PDF, or a geometric model). Parameters of the numerical model are adjusted in order to represent the category. Level 3 Learning (Performance): All parameters in a complex system are simultaneously adjusted in order to optimize a global performance metric.

5 Learning Criteria

6 Optimization Methods

7 Maximum Likelihood Learning in a Dynamic Bayesian Network Given: a particular model structure Given: a set of training examples for that model, (b m,o m ), 1≤m≤M Estimate all model parameters (p (b|a), p (c|a),…) in order to maximize  m log p(b m,o m | ) Recognition is Nested within Training: at each step of the training algorithm, we need to compute p(b m,o m,a m,…,q m ) for every training token, using sum-product algorithm. a bc efd no q o b

8 Baum’s Theorem (Baum and Eagon, Bull. Am. Math. Soc., 1967)

9 Expectation Maximization (EM)

10 EM for a Discrete-Variable Bayesian Network a bc efd no q o b

11 a bc efd no q o b

12 Solution: Lagrangian Method

13 The EM Algorithm for a Large Training Corpus

14 EM for Continuous Observations (Liporace, IEEE Trans. Inf. Th., 1982)

15 Solution: Lagrangian Method

16 Example: Gaussian (Liporace, IEEE Trans. Inf. Th., 1982)

17 Example: Mixture Gaussian (Juang, Levinson, and Sondhi, IEEE Trans. Inf. Th., 1986)

18 Example: Bourlard-Morgan Hybrid (Morgan and Bourlard, IEEE Sign. Proc. Magazine 1995)

19 Pseudo-Priors and Training Priors

20 Training the Hybrid Model Using the EM Algorithm

21 The Solution: Q Back-Propagation

22 Merging the EM and Gradient Ascent Loops

23 Example: BDFK Hybrid (Bengio, De Mori, Flammia, and Kompe, Spe. Comm. 1992)

24 The Q Function for a BDFK Hybrid

25 The EM Algorithm for a BDFK Hybrid

26 Discriminative Learning Criteria

27 Maximum Mutual Information

28

29

30

31 An EM-Like Algorithm for MMI

32

33 MMI for Databases with Different Kinds of Transcription If every word’s start and end times are labeled, then WT is the true word label, and W* is the label of the false word (or words!) with maximum modeled probability. If the start and times of individual word strings are not known, then WT is the true word sequence. W* may be computed as the best path (or paths) through a word lattice or N-best list. (Schlüter, Macherey, Müller, and Ney, Spe. Comm. 2001)

34 Minimum Classification Error (McDermott and Katagiri, Comput. Speech Lang. 1994) Define empirical risk as “the number of word tokens for which the wrong HMM has higher log-likelihood than the right HMM” This risk definition has two nonlinearities: –Zero-one loss function, u(x). Replace with a differentiable loss function,  (x). –Max. Replace with a “softmax” function, log(exp(a)+exp(b)+exp(c)). Differentiate the result; train all HMM parameters using error backpropagation.

35 Summary What is Machine Learning? –choose an optimality criterion, –find an algorithm that will adjust model parameters to optimize the criterion Maximum Likelihood –Baum’s theorem: argmax E[log(p)] = argmax[p] –Apply directly to discrete, Gaussian, MG –Nest within EBP for BM and BDFK hybrids Discriminative Criteria –Maximum Mutual Information (MMI) –Minimum Classification Error (MCE)


Download ppt "Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson"

Similar presentations


Ads by Google