Angelo Dalli Department of Intelligent Computing Systems Hidden Markov Models Angelo Dalli Department of Intelligent Computing Systems
Overview Definition Simple Example 3 Basic Problems Forward Algorithm Viterbi Algorithm Baum - Welch Algorithm Example Application Conclusion
Definition Markov Model H = {A, B, N, P} , where…
Elements Set of N states {S1,…,SN} M distinct observation symbols V = {v1,…,vM} per state Our “finite grammar” We assume discrete here, but could be continuous State transition probability matrix A Observation probability distribution B B = {bj(k) | bj(k) = P(Ot = vk | qt = Sj), 1< k <M, 1< j <N} , where Ot,qt represent observation and state at time t respectively Again, could be continuous pdf modeled by something like Gaussian mixtures Initial state distribution P = {pi | pi = P(q1 = i), 1< i <N}
Matrix of state transition probabilities Where
Markov Chain with 5 states
Observable vs. Hidden Observable: output state is completely determined at each instance of time For example, if output at time t is state itself: 2 state heads/tails coin toss model Hidden: states must be inferred from observations In other words, observation is probabilistic function of state
Simple Example: Urn and Ball N urns sitting in a room Each one has M distinct colored balls Magic genie selects an urn at random, based on some probability distribution Genie selects ball randomly from this urn, tells us the color and puts it back She/he then moves on to next urn based on second prob distribution, and repeats process
Obvious Markov Model here: Each urn is a state Genie’s initial selection is based on initial state probability, P Probability of selecting a certain color determined by observation probability matrix, B The likelihood of the “next” urn is determined by the matrix of transition probabilities, A. At end we have observation sequence, for example O = {red, blue, green, red, green, magenta}
Where’s genie? If Genie location is known at each time instant t, then model is observed Otherwise, this is a hidden model, and we can only infer state at time t, given our string of observations and known probabilities
Three Basic Problems for HMM’s Given observation sequence O = O1O2…OT , and Markov Model H = {A,B,P} , how do we (efficiently) compute P(O | H) - Given several model choices, can be used to determine most appropriate one
Three Basic Problems for HMM’s Given observation sequence O = O1O2…OT , and Markov Model H = {A,B,P} , find optimal state sequence q = q1…qT Optimality criterion needs to be determined interest is finding the “correct” state sequence
Three Basic Problems for HMM’s Given observation sequence O = O1O2…OT , estimate parameters for Model H = {A,B,P} that maximize P(O | H) -observation sequence used here to train model, adapting it to best fit observed phenomenon
Problem 1 : compute P(O | H) Straighforward (bad) Solution For given state sequence q = {q1,…,qT} we have The probability of sequence q occurring is P(q | H) = piaq1q2…a(qT-1)qT
Bad solution continued Joint probability of O and q is product of two: P(O,q | H) = P(O | q,H)P(q | H) Probability of O is P(O,q | H) over set of all possible sequences Q: P(O | H) =
No Good Computation for this direct method is O(2TNT) Not reasonable even for small values of N and T Need to find efficient way
Problem 1 : Compute P(O | H) Efficient Solution The forward algorithm
The Forward Algorithm Let ft(i) = P(O1…Ot, qt = Si | H) Initialization: f1(i) = pibi(O1) , 1 < i < N Induction:
Forward Algorithm Finally: P(O | H) = Requires O(N2T) calculations Much less than direct method
Problem 2: Given O, H, find “optimal” q Of course, depends on optimality criterion Several likely candidates: Maximize number of correct individual states Does not consider transitions -> may lead to illegal sequences Maximize number of correct duples, triples, etc. Find single best state sequence i.e. maximize P(q | O,H) This is most common criterion, and it is solved via the Viterbi algorithm
Prob 2 solution: Viterbi Algorithm Define: -Highest prob of single path at time t ending in state Si Inductively speaking:
Viterbi Algorithm Need to keep track of argument which maximizes our delta function for each timet,state i We use array rt(i) Now: Initialization: r1(i) = 0 , 1 < i < N
Recursion: rt(i) = At end, we have final probability and the end state:
Backtrack to get entire path: t = T-1, T-2,…, 1
Problem 3: Given O, estimate parameters for H to maximize P(O|H) No known way to analytically maximize P(O | H), or to solve for optimal parameters Can locally maximize P(O | H) with Baum - Welch Algorithm
Solution to 3: Baum - Welch Algorithm Quite lengthy and beyond our time frame Suffice to say, it works Other solutions to 3 used, including EM
Ergodic vs. Left-to-Right Ergodic model: Left-to-Right Model:
Reduces size of model, and makes prob 3 easier Variations on HMM Null transition Transition between states that produces no output For ex: to model alternate word pronunciations Tied Parameters Set up equivalence relation between parameters For ex: between observation prob of 2 states which have same B Reduces size of model, and makes prob 3 easier State duration density Inherent prob of staying in state Si for d iterations is (aii)d-1(1-aii) May not be appropriate for physical signals, and so an explicit state duration probability density is introduced
Issues with HMM implementation Scaling Product of very small terms -> machine may not be precise enough, so we scale Multiple observation sequences In left-to-right model, small number of observations available for each state, requiring several sequences for parameter estimation (prob 3) Initial estimate Normal distributions fine for P ,A , but B is sensitive to initial estimate Again, this is an issue for problem 3
Issues with HMM implementation Insufficient training data For ex: not enough occurrences of different events in a given state Possible solution: reduce model to subset for which more data exists, and linearly interpolate between model parameters Interp weightings a function of amount of training data Alternately, could impose some lower bound on individual observation probabilities Model choice E rgodic vs. LTR (or other), Continuous vs. discrete observation densities, number of states, etc.
Markov Processes Used in Composition Xenakis Tenney Hiller Chadabe (performance) Charles Ames Student of Hiller Many others since
Example Application Isolated word recognition (Rabiner) Each word v modeled as distinct HMM Hv Training set of k occurrences per word O1,…,Ok Each of which is an observation sequence Need to: estimate parameters for each Hv that maximize P(O1,…,Ok | Hv) (i.e. prob 3) Extract features O = (O1,…,OT) from unknown word Calculate P(O | Hv) for all v (prob 1), find v which maximizes
Make Observation Feature extraction: at each frame, cepstral coefficients and their derivatives are taken Vector Quantization: observed frame is mapped to possible observation (codebook entry) via nearest neighbor Assuming discrete observation probability Codebook entries estimated by segmenting training data, and taking centroid of all frame vectors for each segment. A la k-means clustering
Choice of Model and Parameters Left-to-Right model more appropriate Thus we have P(q1 = S1) = 1 Choice of states - two ideas: Let state correspond to phoneme Let state correspond to analysis frame Update model parameters: Segment training data into states based on current model using Viterbi algorithm (prob 2) Update A,B probabilities based on observed data Ex: bj(Ok) number of observed vectors nearest to Ok in state j divided by total number of observed vectors in state j
State Duration Density If phoneme segmentation used, it may be advantageous to determine a state duration density Variable state length for each phoneme Pyramid of death
Conclusion Advantages Limitations Has contributed quite a bit to speech recognition With algorithms we have described, computation is reasonable Complex processes can be modeled with low-dimensional data Works well for time varying classification other examples: gesture recognition, formant tracking Limitations Assumption that successive observations are independent First order assumption: probability state at time t only depends on state at time t-1 Need to be “tailor made” for specific application Needs lots of training data, in order to see all observations