1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

Slides:



Advertisements
Similar presentations
Principles of Density Estimation
Advertisements

Angelo Dalli Department of Intelligent Computing Systems
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
K Means Clustering , Nearest Cluster and Gaussian Mixture
Supervised Learning Recap
Automatic Speech Recognition II  Hidden Markov Models  Neural Network.
Self Organization: Competitive Learning
Introduction to Bioinformatics
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
An Introduction to Hidden Markov Models and Gesture Recognition Troy L. McDaniel Research Assistant Center for Cognitive Ubiquitous Computing Arizona State.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Pattern recognition Professor Aly A. Farag
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Segmentation Divide the image into segments. Each segment:
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Chapter 4 (part 2): Non-Parametric Classification
Dynamic Time Warping Applications and Derivation
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Radial Basis Function Networks
Isolated-Word Speech Recognition Using Hidden Markov Models
Principles of Pattern Recognition
Speech Recognition with Hidden Markov Models Winter 2011
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
1 CS 552/652 Speech Recognition with Hidden Markov Models Spring 2010 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Chapter 9 DTW and VQ Algorithm  9.1 Basic idea of DTW  9.2 DTW algorithm  9.3 Basic idea of VQ  9.4 LBG algorithm  9.5 Improvement of VQ.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Vector Quantization Vector quantization is used in many applications such as image and voice compression, voice recognition (in general statistical pattern.
Vector Quantization CAP5015 Fall 2005.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
CHAPTER 14 Competitive Networks Ming-Feng Yeh.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Hidden Markov Models Part 2: Algorithms
CONTEXT DEPENDENT CLASSIFICATION
LECTURE 15: REESTIMATION, EM AND MIXTURES
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul Hosom Lecture 4 January 12 Hidden Markov Models, Vector Quantization

2 Review: Markov Models Example 4: Marbles in Jars (lazy person) Jar 1 Jar 2 Jar 3 S1S1 S2S S3S (assume unlimited number of marbles)

3 Example 4: Marbles in Jars (con’t) S 1 = event 1 = black S 2 = event 2 = white A = {a ij } = S 3 = event 3 = grey what is probability of {grey, white, white, black, black, grey}? Obs. = {g, w, w, b, b, g} S ={S 3, S 2, S 2, S 1, S 1, S 3 } time = {1, 2, 3, 4, 5, 6} = P[S 3 ] P[S 2 |S 3 ] P[S 2 |S 2 ] P[S 1 |S 2 ] P[S 1 |S 1 ] P[S 3 |S 1 ] = 0.33 · 0.3 · 0.6 · 0.2 · 0.6 · 0.1 = π 1 = 0.33 π 2 = 0.33 π 3 = 0.33 Review: Markov Models

4 Hidden Markov Model: more than 1 event associated with each state. all events have some probability of emitting at each state. given a sequence of observations, we can’t determine exactly the state sequence. We can compute the probabilities of different state sequences given an observation sequence. Doubly stochastic (probabilities of both emitting events and transitioning between states); exact state sequence is “hidden.” What is a Hidden Markov Model?

5 Elements of a Hidden Markov Model: clockt = {1, 2, 3, … T} N statesQ = {1, 2, 3, … N} M eventsE = {e 1, e 2, e 3, …, e M } initial probabilitiesπ j = P[q 1 = j]1  j  N transition probabilitiesa ij = P[q t = j | q t-1 = i]1  i, j  N observation probabilitiesb j (k)=P[o t = e k | q t = j]1  k  M b j (o t )=P[o t = e k | q t = j]1  k  M A = matrix of a ij values, B = set of observation probabilities, π = vector of π j values. Entire Model: = (A,B,π) What is a Hidden Markov Model?

6 Notes: an HMM still generates observations, each state is still discrete, observations can still come from a finite set (discrete HMMs). the number of items in the set of events does not have to be the same as the number of states. when in state S, there’s p(e 1 ) of generating event 1, there’s p(e 2 ) of generating event 2, etc. What is a Hidden Markov Model? p S2 (black) = 0.6 p S2 (white) = 0.4 S1S1 S2S p S1 (black) = 0.3 p S1 (white) = 0.7

7 Example 1: Marbles in Jars (lazy person) Jar 1Jar 2Jar 3 S1S1 S2S S3S (assume unlimited number of marbles) What is a Hidden Markov Model? p(b) =0.8 p(w)=0.1 p(g) =0.1 p(b) =0.2 p(w)=0.5 p(g) =0.3 p(b) =0.1 p(w)=0.2 p(g) =0.7 State 3State 2State 1  1 =0.33  2 =0.33  3 =0.33

8 Example 1: Marbles in Jars (lazy person) (assume unlimited number of marbles) With the following observation: What is probability of this observation, given state sequence {S 3 S 2 S 2 S 1 S 1 S 3 } and the model?? = b 3 (g) b 2 (w) b 2 (w) b 1 (b) b 1 (b) b 3 (g) = 0.7 ·0.5 · 0.5 · 0.8 · 0.8 · 0.7 = What is a Hidden Markov Model? g w w b b g

9 Example 1: Marbles in Jars (lazy person) (assume unlimited number of marbles) With the same observation: What is probability of this observation, given state sequence {S 1 S 1 S 3 S 2 S 3 S 1 } and the model?? = b 1 (g) b 1 (w) b 3 (w) b 2 (b) b 3 (b) b 1 (g) = 0.1 ·0.1 · 0.2 · 0.2 · 0.1 · 0.1 = 4.0x10 -6 What is a Hidden Markov Model? g w w b b g

10 What is a Hidden Markov Model? Some math… With an observation sequence O=(o 1 o 2 … o T ), state sequence q=(q 1 q 2 … q T ), and model : Probability of O, given state sequence q and model, is: assuming independence between observations. This expands: -- or -- The probability of the state sequence q can be written:

11 What is a Hidden Markov Model? The probability of both O and q occurring simultaneously is: which can be expanded to: Independence between a ij and b j (o t ) is NOT assumed: this is just multiplication rule: P(A  B) = P(A | B) P(B)

12 black:S 2 /0.6×0.1 white:S 2 /0.4×0.1 black:S 1 /0.3×0.2 There is a direct correspondence between a Hidden Markov Model (HMM) and a Weighted Finite State Transducer (WFST). In an HMM, the (generated) observations can be thought of as inputs, we can (and will) generate outputs based on state names, and there are probabilities of transitioning between states. In a WFST, there are the same inputs, outputs, and transition weights (or probabilities) What is a Hidden Markov Model? p S2 (black) = 0.6 p S2 (white) = 0.4 S1S1 S2S p S1 (black) = 0.3 p S1 (white) = 0.7 white:S 1 /0.7×0.2 black:S 1 /0.3×0.8 white:S 1 /0.7× black:S 2 /0.6×0. 9 white:S 2 /0.4×0.9

13 In the HMM case, we can compute the probability of generating the observations. The state sequence corresponding to the observations can be computed. In the WFST case, we can compute the cumulative weight (total probability) when we map from the (input) observations to the (output) state names. For the WFST, the states (0, 1, 2) are independent of the output; for an HMM, the state names (S 1, S 2 ) map to the output in ways that we’ll look at later. We’ll talk later in the course in more detail about WFST, but for now, be aware that any HMM for speech recognition can be transformed into an equivalent WFST, and vice versa. What is a Hidden Markov Model?

14 What is a Hidden Markov Model? Example 2: Weather and Atmospheric Pressure P( )=0.1 P( )=0.2 P( )=0.8 H P( )=0.3 P( )=0.4 P( )=0.3 M L P( )=0.6 P( )=0.3 P( )=0.1  H = 0.4  M = 0.2  L = 0.4

15 What is a Hidden Markov Model? Example 2: Weather and Atmospheric Pressure If weather observation O={sun, sun, cloud, rain, cloud, sun} what is probability of O, given the model and the sequence {H, M, M, L, L, M}? = b H (sun) b M (sun) b M (cloud) b L (rain) b L (cloud) b M (sun) = 0.8 ·0.3 · 0.4 · 0.6 · 0.3 · 0.3 = 5.2x10 -3

16 What is a Hidden Markov Model? Example 2: Weather and Atmospheric Pressure What is probability of O={sun, sun, cloud, rain, cloud, sun} and the sequence {H, M, M, L, L, M}, given the model? =  H ·b H (s) ·a HM ·b M (s) ·a MM ·b M (c) ·a ML ·b L (r) ·a LL ·b L (c) ·a LM ·b M (s) = 0.4 · 0.8 · 0.3 · 0.3 · 0.2 · 0.4 · 0.5 · 0.6 · 0.3 · 0.3 · 0.6 · 0.3 = 1.12x10 -5 What is probability of O={sun, sun, cloud, rain, cloud, sun} and the sequence {H, H, M, L, M, H}, given the model? =  H ·b H (s) ·a HH ·b H (s) ·a HM ·b M (c) ·a ML ·b L (r) ·a LM ·b M (c) ·a MH ·b H (s) = 0.4 · 0.8 · 0.6 · 0.8 · 0.3 · 0.4 · 0.5 · 0.6 · 0.6 · 0.4 · 0.3 · 0.6 = 2.39x10 -4

17 Notes about HMMs: must know all possible states in advance must know possible state connections in advance cannot recognize things outside of model must have some estimate of state emission probabilities and state transition probabilities make several assumptions (usually so math is easier) if we can find best state sequence through an HMM for a given observation, we can compare multiple HMMs for recognition. (next week) What is a Hidden Markov Model?

18 When multiplying many numbers together, we run the risk of underflow errors… one solution is to transform everything into the log domain: linear domainlog domain x y e y · x x·y x+y x+y logAdd(x,y) logAdd(a,b) computes the log-domain sum of a and b when both a and b are already in log domain. In the linear domain: Log-Domain Mathematics

19 Log-Domain Mathematics log-domain mathematics avoids underflow, allows (expensive) multiplications to be transformed to (cheap) additions. Typically used in HMMs, because there are a large number of multiplications… O(F) where F is the number of frames. If F is moderately large (e.g. 5 seconds of speech = 500 frames), even large probabilities (e.g. 0.9) yield small results: = 1.3× = 2.8× = 7.9× = 8.3× For the examples in class, we’ll stick with linear domain, but in class projects, you’ll want to use log domain math. Major point: logAdd(x,y) is NOT same as log(x×y) = log(x)+log(y)

20 Log-Domain Mathematics Things to be careful of when working in the log domain: 1. When accumulating probabilities over time, normally you would set an initial value to 1, then multiply several times: totalProb = 1.0; for (t = 0; t < maxTime; t++) { totalProb ×= localProb[t]; } When dealing in the log domain, not only does multiplication become addition, but the initial value should be set to log(1), which is 0. And, log(0) can be set to some very negative constant. 2. Working in the log domain is only useful when dealing with probabilities (because probabilities are never negative). When dealing with features, it may be necessary to compute feature values in the linear domain. Probabilities can then be computed in the linear domain and converted, or computed in the log domain directly. (Depending on how prob. are computed.)

21 HMM Topologies There are a number of common topologies for HMMs: Ergodic (fully-connected) Bakis (left-to-right) S1S1 S2S2 S3S3  1 = 0.4  2 = 0.2  3 = S1S1 S2S2  1 = 1.0  2 = 0.0  3 = 0.0  4 = 0.0 S3S S4S

22 HMM Topologies Many varieties are possible: Topology defined by the state transition matrix (If an element of this matrix is zero, there is no transition between those two states) S1S1 0.4 S2S2  1 = 0.5  2 = 0.0  3 = 0.0  4 = 0.5  5 = 0.0  6 = 0.0 S6S S4S4 0.2 S3S S5S a 11 a 12 a a 22 a 23 a a 33 a a 44 A =

23 HMM Topologies The topology must be specified in advance by the system designer Common use in speech is to have one HMM per phoneme, and three states per phoneme. Then, the phoneme-level HMMs can be connected to form word-level HMMs  1 = 1.0  2 = 0.0  3 = A1A1 A2A2 A3A B1B1 B2B2 B3B A1A1 A2A2 A3A T1T1 T2T2 T3T

24 Vector Quantization Vector Quantization (VQ) is a method of automatically partitioning a feature space into different clusters based on training data. Given a test point (vector) from the feature space, we can determine the cluster that this point should be associated with. A “codebook” lists central locations of each cluster, and gives each cluster a name (usually a numerical index). This can be used for data reduction (mapping a large number of feature points to a much smaller number of clusters), or for probability estimation. Requires data to train on, a distance measure, and test data.

25 Required distance measure: d(v i,v j ) = d ij = 0 if v i = v j > 0 otherwise Should also have symmetry and triangle inequality properties. Often use Euclidean distance in log-spectral or log-cepstral space. Vector Quantization Vector Quantization for pattern classification:

26 Vector Quantization How to “train” a VQ system (generate a codebook): K-means clustering 1. Initialization: choose M data points (vectors) from L training vectors (typically M=2 B ) as initial code words… random or maximum distance. 2. Search: for each training vector, find the closest code word, assign this training vector to that code word’s cluster. 3. Centroid Update: for each code word cluster (group of data points associated with a code word), compute centroid. The new code word is the centroid. 4. Repeat Steps (2)-(3) until average distance falls below threshold (or no change). Final codebook contains identity and location of each code word.

27 Vector Quantization Example Given the following (integer) data points, create codebook of 4 clusters, with initial code word values at (2,2), (4,6), (6,5), and (8,8)

28 Vector Quantization Example compute centroids of each code word, re-compute nearest neighbor, re-compute centroids

29 Vector Quantization Example Once there’s no more change, the feature space will be partitioned into 4 regions. Any input feature can be classified as belonging to one of the 4 regions. The entire codebook is specified by the 4 centroid points Voronoi cell

30 1. Design 1-vector codebook (no iteration) 2. Double codebook size by splitting each code word y n according to the rule: where 1  n  M, and  is a splitting parameter (0.01    0.05) 3. Use K-means algorithm to get best set of centroids 4. Repeat (2)-(3) until desired codebook size is obtained. Vector Quantization How to Increase Number of Clusters? Binary Split Algorithm

31 Vector Quantization

32 Vector Quantization Given a set of data points, create a codebook with 2 code words: use K-means to assign all data points to new code words and compute new centroids, repeat (3) and (4) until stable create codebook with one code word, y n create 2 code words from the original code word:

33 Vector Quantization Notes: If we keep training data information (number of data points per code word), VQ can be used to construct “discrete” HMM observation probabilities: Classification and probability estimation using VQ is fast… just table lookup No assumptions are made about Normal or other probability distribution of training data Quantization error may occur if samples near codebook boundary

34 Vector Quantization Vector quantization used in “discrete” HMM Given input vector, determine discrete centroid with best match Probability depends on relative number of training samples in that region feature value 1 for state j feature value 2 for state j b j (k) = number of vectors with codebook index k in state j number of vectors in state j =

35 Vector Quantization Other states have their own data, and their own VQ partition Important that all states have same number of code words For HMMs, compute the probability that observation o t is generated by each state j. Here, there are two states, red and blue: b blue (o t ) = 14/56 = 1/4 = 0.25 b red (o t ) = 8/56 = 1/7 = 0.14

36 Vector Quantization A number of issues need to be addressed in practice: what happens if a single cluster gets a small number of points, but other clusters could still be reliably split? how are initial points selected? how is determined? other clustering techniques (pairwise nearest neighbor, Lloyd algorithm, etc) splitting a tree using “balanced growing” (all nodes split at same time) or “unbalanced growing” (split one node at a time) tree pruning algorithms… different splitting algorithms…