Download presentation
Published byCharles Merrill Modified over 9 years ago
1
Angelo Dalli Department of Intelligent Computing Systems
Hidden Markov Models Angelo Dalli Department of Intelligent Computing Systems
2
Overview Definition Simple Example 3 Basic Problems Forward Algorithm
Viterbi Algorithm Baum - Welch Algorithm Example Application Conclusion
3
Definition Markov Model H = {A, B, N, P} , where…
4
Elements Set of N states {S1,…,SN}
M distinct observation symbols V = {v1,…,vM} per state Our “finite grammar” We assume discrete here, but could be continuous State transition probability matrix A Observation probability distribution B B = {bj(k) | bj(k) = P(Ot = vk | qt = Sj), 1< k <M, 1< j <N} , where Ot,qt represent observation and state at time t respectively Again, could be continuous pdf modeled by something like Gaussian mixtures Initial state distribution P = {pi | pi = P(q1 = i), 1< i <N}
5
Matrix of state transition probabilities
Where
6
Markov Chain with 5 states
7
Observable vs. Hidden Observable: output state is completely determined at each instance of time For example, if output at time t is state itself: 2 state heads/tails coin toss model Hidden: states must be inferred from observations In other words, observation is probabilistic function of state
8
Simple Example: Urn and Ball
N urns sitting in a room Each one has M distinct colored balls Magic genie selects an urn at random, based on some probability distribution Genie selects ball randomly from this urn, tells us the color and puts it back She/he then moves on to next urn based on second prob distribution, and repeats process
9
Obvious Markov Model here:
Each urn is a state Genie’s initial selection is based on initial state probability, P Probability of selecting a certain color determined by observation probability matrix, B The likelihood of the “next” urn is determined by the matrix of transition probabilities, A. At end we have observation sequence, for example O = {red, blue, green, red, green, magenta}
10
Where’s genie? If Genie location is known at each time instant t, then model is observed Otherwise, this is a hidden model, and we can only infer state at time t, given our string of observations and known probabilities
11
Three Basic Problems for HMM’s
Given observation sequence O = O1O2…OT , and Markov Model H = {A,B,P} , how do we (efficiently) compute P(O | H) - Given several model choices, can be used to determine most appropriate one
12
Three Basic Problems for HMM’s
Given observation sequence O = O1O2…OT , and Markov Model H = {A,B,P} , find optimal state sequence q = q1…qT Optimality criterion needs to be determined interest is finding the “correct” state sequence
13
Three Basic Problems for HMM’s
Given observation sequence O = O1O2…OT , estimate parameters for Model H = {A,B,P} that maximize P(O | H) -observation sequence used here to train model, adapting it to best fit observed phenomenon
14
Problem 1 : compute P(O | H)
Straighforward (bad) Solution For given state sequence q = {q1,…,qT} we have The probability of sequence q occurring is P(q | H) = piaq1q2…a(qT-1)qT
15
Bad solution continued
Joint probability of O and q is product of two: P(O,q | H) = P(O | q,H)P(q | H) Probability of O is P(O,q | H) over set of all possible sequences Q: P(O | H) =
16
No Good Computation for this direct method is O(2TNT)
Not reasonable even for small values of N and T Need to find efficient way
17
Problem 1 : Compute P(O | H)
Efficient Solution The forward algorithm
18
The Forward Algorithm Let ft(i) = P(O1…Ot, qt = Si | H)
Initialization: f1(i) = pibi(O1) , 1 < i < N Induction:
19
Forward Algorithm Finally: P(O | H) = Requires O(N2T) calculations
Much less than direct method
20
Problem 2: Given O, H, find “optimal” q
Of course, depends on optimality criterion Several likely candidates: Maximize number of correct individual states Does not consider transitions -> may lead to illegal sequences Maximize number of correct duples, triples, etc. Find single best state sequence i.e. maximize P(q | O,H) This is most common criterion, and it is solved via the Viterbi algorithm
21
Prob 2 solution: Viterbi Algorithm
Define: -Highest prob of single path at time t ending in state Si Inductively speaking:
22
Viterbi Algorithm Need to keep track of argument which maximizes our delta function for each timet,state i We use array rt(i) Now: Initialization: r1(i) = 0 , 1 < i < N
23
Recursion: rt(i) = At end, we have final probability and the end state:
24
Backtrack to get entire path:
t = T-1, T-2,…, 1
25
Problem 3: Given O, estimate parameters for H to maximize P(O|H)
No known way to analytically maximize P(O | H), or to solve for optimal parameters Can locally maximize P(O | H) with Baum - Welch Algorithm
26
Solution to 3: Baum - Welch Algorithm
Quite lengthy and beyond our time frame Suffice to say, it works Other solutions to 3 used, including EM
27
Ergodic vs. Left-to-Right
Ergodic model: Left-to-Right Model:
28
Reduces size of model, and makes prob 3 easier
Variations on HMM Null transition Transition between states that produces no output For ex: to model alternate word pronunciations Tied Parameters Set up equivalence relation between parameters For ex: between observation prob of 2 states which have same B Reduces size of model, and makes prob 3 easier State duration density Inherent prob of staying in state Si for d iterations is (aii)d-1(1-aii) May not be appropriate for physical signals, and so an explicit state duration probability density is introduced
29
Issues with HMM implementation
Scaling Product of very small terms -> machine may not be precise enough, so we scale Multiple observation sequences In left-to-right model, small number of observations available for each state, requiring several sequences for parameter estimation (prob 3) Initial estimate Normal distributions fine for P ,A , but B is sensitive to initial estimate Again, this is an issue for problem 3
30
Issues with HMM implementation
Insufficient training data For ex: not enough occurrences of different events in a given state Possible solution: reduce model to subset for which more data exists, and linearly interpolate between model parameters Interp weightings a function of amount of training data Alternately, could impose some lower bound on individual observation probabilities Model choice E rgodic vs. LTR (or other), Continuous vs. discrete observation densities, number of states, etc.
31
Markov Processes Used in Composition
Xenakis Tenney Hiller Chadabe (performance) Charles Ames Student of Hiller Many others since
32
Example Application Isolated word recognition (Rabiner)
Each word v modeled as distinct HMM Hv Training set of k occurrences per word O1,…,Ok Each of which is an observation sequence Need to: estimate parameters for each Hv that maximize P(O1,…,Ok | Hv) (i.e. prob 3) Extract features O = (O1,…,OT) from unknown word Calculate P(O | Hv) for all v (prob 1), find v which maximizes
33
Make Observation Feature extraction: at each frame, cepstral coefficients and their derivatives are taken Vector Quantization: observed frame is mapped to possible observation (codebook entry) via nearest neighbor Assuming discrete observation probability Codebook entries estimated by segmenting training data, and taking centroid of all frame vectors for each segment. A la k-means clustering
34
Choice of Model and Parameters
Left-to-Right model more appropriate Thus we have P(q1 = S1) = 1 Choice of states - two ideas: Let state correspond to phoneme Let state correspond to analysis frame Update model parameters: Segment training data into states based on current model using Viterbi algorithm (prob 2) Update A,B probabilities based on observed data Ex: bj(Ok) number of observed vectors nearest to Ok in state j divided by total number of observed vectors in state j
35
State Duration Density
If phoneme segmentation used, it may be advantageous to determine a state duration density Variable state length for each phoneme Pyramid of death
36
Conclusion Advantages Limitations
Has contributed quite a bit to speech recognition With algorithms we have described, computation is reasonable Complex processes can be modeled with low-dimensional data Works well for time varying classification other examples: gesture recognition, formant tracking Limitations Assumption that successive observations are independent First order assumption: probability state at time t only depends on state at time t-1 Need to be “tailor made” for specific application Needs lots of training data, in order to see all observations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.