. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss
Outline u Markov Models u Hidden Markov Models u The Main Problems in HMM Context u Implementation Issues u Applications of HMMs
Weather: A Markov Model Sunny Rainy Snowy 80% 15% 5% 60% 2% 38% 20% 75%5%
Ingredients of a Markov Model u States: u State transition probabilities: u Initial state distribution: Sunny Rainy Snowy 80% 15% 5% 60% 2% 38% 20% 75% 5%
Ingredients of Our Markov Model u States: u State transition probabilities: u Initial state distribution: Sunny Rainy Snowy 80% 15% 5% 60% 2% 38% 20% 75% 5%
Probability of a Seq. of States u Given: u What is the probability of this seq. of states?
Outline u Markov Models u Hidden Markov Models u The Main Problems in HMM Context u Implementation Issues u Applications of HMMs
Hidden Markov Models Sunny Rainy Snowy 80% 15% 5% 60% 2% 38% 20% 75%5% Sunny Rainy Snowy 80% 15% 5% 60% 2% 38% 20% 75% 5% 60% 10% 30% 65% 5% 30% 50% 0% 50% NOT OBSERVABLE
Ingredients of an HMM u States: u State transition probabilities: u Initial state distribution: Observations: Observation probabilities: emit output k in state j prob of moving from state j to state i
Ingredients of Our HMM u States: u Observations: u State transition probabilities: u Initial state distribution: u Observation probabilities:
Three Basic Problems u Evaluation (aka likelihood): l compute P(O| an HMM) u Decoding (aka inference): l given an observed output sequence O H compute most likely state at each time period H compute most likely state sequence l q* = argmax_q P(q|O, HMM) u Training (aka learning): l find HMM* = argmax_HMM P(O|HMM)
Probability of an Output Sequence u Given: u What is the probability of this output sequence? exponential number of terms
The Forward Algorithm S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 …
The Forward Algorithm (cont.) S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 … first, get to state i, then move to state j, then omit output O[t+1]
Exercise u What is the probability of observing AB? a. Initial state s 1 : b. Initial state chosen at random: s2s2 s1s B 0.7 A 0.2 B 0.8 A 0.2 (0.4 0.7) = (0.5 0.3 (0.3 0.8)) =
The Backward Algorithm S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 … P(O) = sum over i: P(q1 is i) * P(emit O1 in state i) * beta_1(i)
The Forward-Backward Algorithm S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 S2S2 S3S3 S1S1 O2O2 O3O3 O1O1 … P(O) = sum over i: alpha_t(i) * beta_t(i) for any t => you can derive the formulas for forward alg and backward alg from this
Finding the best state sequence We would like to find the most likely path (and not just the most likely state at each time slice) The Viterbi algorithm is an efficient method for finding the MPE: and we to reconstruct the path:
Hidden Markov Models Sunny Rainy Snowy 80% 15% 5% 60% 2% 38% 20% 75%5% Sunny Rainy Snowy 80% 15% 5% 60% 2% 38% 20% 75% 5% 60% 10% 30% 65% 5% 30% 50% 0% 50% NOT OBSERVABLE
Learning the Model with EM Problem: Find HMM that makes data most likely E-Step: Compute for given M-Step: Compute new under these expectations (this is now a Markov model)
E-Step u Calculate using the forward-backward algorithm, for fixed model
The M Step: generate =( , a, b)
Understanding the EM Algorithm u The best way to understand the EM algorithm l start with the M step, understand what quantities it needs l then look at the E step, see how it computes those quantities with the help of the forward- backward algorithm
Summary (Learning) Given observation sequence O Guess initial model u Iterate: Calculate expected times in state S i at time t (and in S j at time t ) using forward-backward algorithm Find new model by frequency counts
Implementing HMM Algorithms u Quantities get very small for long sequences u Taking logarithm helps l the Viterbi algorithm l computing the alphas and betas l not helpful in computing gammas u Normalization method can help these problems l see the note by ChengXiang Zhai
Problems with HMMs u Zero probabilities l Training sequence: AAABBBAAA l Test sequence: AAABBBCAAA u Finding “right” number of states, right structure u Numerical instabilities
Outline u Markov Models u Hidden Markov Models u The Main Problems in HMM Context u Implementation Issues u Applications of HMMs
Three Problems u What bird is this? u How will the song continue? u Is this bird abnormal? Time series classification Time series prediction Outlier detection
Time Series Classification Train one HMM l for each bird l Given time series O, calculate
Outlier Detection Train HMM Given time series O, calculate probability u If abnormally low, raise flag u If high, raise flag
Time Series Prediction Train HMM Given time series O, calculate distribution over final state (via ) and ‘hallucinate’ new states and observations according to a, b
Typical HMM in Speech Recognition 20-dim frequency space clustered using EM Use Bayes rule + Viterbi for classification Linear HMM representing one phoneme [Rabiner 86] + everyone else
Typical HMM in Robotics [Blake/Isard 98, Fox/Dellaert et al 99]
IE with Hidden Markov Models Yesterday Pedro Domingos spoke this example sentence. Person name: Pedro Domingos Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name: person name location name background
HMM for Segmentation u Simplest Model: One state per entity type
What is a “symbol” ??? Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ? 4601 => “4601”, “9999”, “9+”, “number”, … ? Datamold: choose best abstraction level using holdout set
HMM Example: “Nymble” Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] Task: Named Entity Extraction Train on ~500k words of news wire text. Case Language F1. Mixed English93% UpperEnglish91% MixedSpanish90% [Bikel, et al 1998], [BBN “IdentiFinder”] Person Org Other (Five other name classes) start-of- sentence end-of- sentence Transition probabilities Observation probabilities P(s t | s t-1, o t-1 ) P(o t | s t, s t-1 ) Back-off to: P(s t | s t-1 ) P(s t ) P(o t | s t, o t-1 ) P(o t | s t ) P(o t ) or Results:
Passage Selection (e.g., for IR) Document Query Collection Information Relevant passages How is a relevant passage different from a background passage in terms of language modeling? Background passages
HMMs: Main Lessons u HMMs: Generative probabilistic models of time series (with hidden state) u Forward-Backward: Algorithm for computing probabilities over hidden states u Learning models: EM, iterates estimation of hidden state and model fitting u Extremely practical, best known methods in speech, computer vision, robotics, … u Numerous extensions exist (continuous observations, states; factorial HMMs, controllable HMMs=POMDPs, …)