CSC 594 Topics in AI – Natural Language Processing Spring 2018 11. Hidden Markov Model (Some slides adapted from Jurafsky & Martin, and Raymond Mooney at UT Austin)
Speech and Language Processing - Jurafsky and Martin Hidden Markov Models Hidden Markov Model (HMM) is a sequence model. A sequence model or sequence classifier is a model which assigns a label or class to each unit in a sequence, thus mapping a sequence of observations to a sequence of labels. An HMM is a probabilistic sequence model: given a sequence of units (e.g. words, letters, morphemes, sentences), they compute a probability distribution over possible sequences of labels and choose the best label sequence. This is a kind of generative model. Speech and Language Processing - Jurafsky and Martin
Sequence Labeling in NLP There are many NLP tasks which use sequence labeling: Speech recognition Part-of-speech tagging Named Entity Recognition Information Extraction But not all NLP tasks are sequence labeling – only the ones where information of the previous things in the sequence affect/help the next one. Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Definitions A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum to one A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous sequences Speech and Language Processing - Jurafsky and Martin
Markov Chain for Weather Speech and Language Processing - Jurafsky and Martin
Markov Chain: “First-order observable Markov Model” A set of states Q = q1, q2…qN; the state at time t is qt Transition probabilities: a set of probabilities A = a01a02…an1…ann. Each aij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A Special initial probability vector π Current state only depends on previous state Speech and Language Processing - Jurafsky and Martin
Markov Chain for Weather What is the probability of 4 consecutive warm days? Sequence is warm-warm-warm-warm And state sequence is 3-3-3-3 P(3,3,3,3) = 3a33a33a33a33 = 0.2 x (0.6)3 = 0.0432 Speech and Language Processing - Jurafsky and Martin
Hidden Markov Model (HMM) However, oftentimes we want to know what produced the sequence – the hidden sequence for the observed sequence. For example, Inferring the words (hidden) from acoustic signal (observed) in speech recognition Assigning part-of-speech tags (hidden) to a sentence (sequence of words) – POS tagging. Assigning named entity categories (hidden) to a sentence (sequence of words) – Named Entity Recognition. Speech and Language Processing - Jurafsky and Martin
Two Kinds of Probabilities State transition probabilities -- p(ti|ti-1) State-to-state transition probabilities Observation/Emission likelihood -- p(wi|ti) Probabilities of observing various values at a given state Speech and Language Processing - Jurafsky and Martin
What’s the state sequence for the observed sequence “1 2 3”? Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Three Problems Given this framework there are 3 problems that we can pose to an HMM Given an observation sequence, what is the probability of that sequence given a model? Given an observation sequence and a model, what is the most likely state sequence? Given an observation sequence, find the best model parameters for a partially specified model Speech and Language Processing - Jurafsky and Martin
Problem 1: Obserbation Likelihood The probability of a sequence given a model... Used in model development... How do I know if some change I made to the model is making things better? And in classification tasks Word spotting in ASR, language identification, speaker identification, author identification, etc. Train one HMM model per class Given an observation, pass it to each model and compute P(seq|model). Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Problem 2: Decoding Most probable state sequence given a model and an observation sequence Typically used in tagging problems, where the tags correspond to hidden states As we’ll see almost any problem can be cast as a sequence labeling problem Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Problem 3: Learning Infer the best model parameters, given a partial model and an observation sequence... That is, fill in the A and B tables with the right numbers -- the numbers that make the observation sequence most likely This is to learn the probabilities! Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Solutions Problem 2: Viterbi Problem 1: Forward Problem 3: Forward-Backward An instance of EM (Expectation Maximization) Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Problem 2: Decoding We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest. Hat ^ means “our estimate of the best one” Argmaxx f(x) means “the x such that f(x) is maximized” Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Getting to HMMs This equation should give us the best tag sequence But how to make it operational? How to compute this value? Intuition of Bayesian inference: Use Bayes rule to transform this equation into a set of probabilities that are easier to compute (and give the right answer) Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Using Bayes Rule Know this. Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Likelihood and Prior Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Definition of HMM States Q = q1, q2…qN; Observations O= o1, o2…oN; Each observation is a symbol from a vocabulary V = {v1,v2,…vV} Transition probabilities Transition probability matrix A = {aij} Observation likelihoods Output probability matrix B={bi(k)} Special initial probability vector Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin HMMs for Ice Cream You are a climatologist in the year 2799 studying global warming You can’t find any records of the weather in Baltimore for summer of 2007 But you find Jason Eisner’s diary which lists how many ice-creams Jason ate every day that summer Your job: figure out how hot it was each day Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Eisner Task Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3… Produce: Hidden Weather Sequence: H,C,H,H,H,C, C… Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin HMM for Ice Cream Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Ice Cream HMM Let’s just do 131 as the sequence How many underlying state (hot/cold) sequences are there? How do you pick the right one? HHH HHC HCH HCC CCC CCH CHC CHH Argmax P(sequence | 1 3 1) Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Ice Cream HMM Let’s just do 1 sequence: CHC Cold as the initial state P(Cold|Start) .2 .5 .4 .3 Observing a 1 on a cold day P(1 | Cold) Hot as the next state P(Hot | Cold) Observing a 3 on a hot day P(3 | Hot) Cold as the next state P(Cold|Hot) .0024 Observing a 1 on a cold day P(1 | Cold) Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin But there are other sequences which generate 313. For example, HHC Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Viterbi Algorithm Create an array Columns corresponding to observations Rows corresponding to possible hidden states Recursively compute the probability of the most likely subsequence of states that accounts for the first t observations and ends in state sj. Also record “backpointers” that subsequently allow backtracing the most probable state sequence. Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin The Viterbi Algorithm Speech and Language Processing - Jurafsky and Martin
Viterbi Example: Ice Cream Speech and Language Processing - Jurafsky and Martin
Viterbi Example
Raymond Mooney at UT Austin Viterbi Backpointers s1 s2 s0 sF sN t1 t2 t3 tT-1 tT Raymond Mooney at UT Austin 31
Raymond Mooney at UT Austin Viterbi Backtrace s1 s2 s0 sF sN t1 t2 t3 tT-1 tT Most likely Sequence: s0 sN s1 s2 …s2 sF Raymond Mooney at UT Austin 32
Speech and Language Processing - Jurafsky and Martin Problem 1: Forward Given an observation sequence return the probability of the sequence given the model... Well in a normal Markov model, the states and the sequences are identical... So the probability of a sequence is the probability of the path sequence But not in an HMM... Remember that any number of sequences might be responsible for any given observation sequence. Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Forward Efficiently computes the probability of an observed sequence given a model P(sequence|model) Nearly identical to Viterbi; replace the MAX with a SUM Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin The Forward Algorithm Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Visualizing Forward Speech and Language Processing - Jurafsky and Martin
Speech and Language Processing - Jurafsky and Martin Problem 3 Infer the best model parameters, given a skeletal model and an observation sequence... That is, fill in the A and B tables with the right numbers... The numbers that make the observation sequence most likely Useful for getting an HMM without having to hire annotators... Speech and Language Processing - Jurafsky and Martin 37
Speech and Language Processing - Jurafsky and Martin Solutions Problem 1: Forward Problem 2: Viterbi Problem 3: EM Or Forward-backward algorithm (or Baum-Welch) Speech and Language Processing - Jurafsky and Martin 38
Speech and Language Processing - Jurafsky and Martin Forward-Backward Baum-Welch = Forward-Backward Algorithm (Baum 1972) Is a special case of the EM or Expectation-Maximization algorithm The algorithm will let us train the transition probabilities A= {aij} and the emission probabilities B={bi(ot)} of the HMM Speech and Language Processing - Jurafsky and Martin 39 39
Intuition for re-estimation of aij We will estimate âij via this intuition: Numerator intuition: Assume we had some estimate of probability that a given transition ij was taken at time t in a observation sequence. If we knew this probability for each time t, we could sum over all t to get expected value for ij. Speech and Language Processing - Jurafsky and Martin 40 40
Intuition for re-estimation of aij Speech and Language Processing - Jurafsky and Martin 41
Speech and Language Processing - Jurafsky and Martin Re-estimating aij The expected number of transitions from state i to state j is the sum over all t of The total expected number of transitions out of state i is the sum over all transitions out of state i Final formula for reestimated aij Speech and Language Processing - Jurafsky and Martin 42 42
The Forward-Backward Alg Speech and Language Processing - Jurafsky and Martin 43
Sketch of Baum-Welch (EM) Algorithm for Training HMMs Assume an HMM with N states. Randomly set its parameters λ=(A,B) (making sure they represent legal distributions) Until converge (i.e. λ no longer changes) do: E Step: Use the forward/backward procedure to determine the probability of various possible state sequences for generating the training data M Step: Use these probability estimates to re-estimate values for all of the parameters λ Raymond Mooney at UT Austin 44
Raymond Mooney at UT Austin EM Properties Each iteration changes the parameters in a way that is guaranteed to increase the likelihood of the data: P(O|). Anytime algorithm: Can stop at any time prior to convergence to get approximate solution. Converges to a local maximum. Raymond Mooney at UT Austin 45
Summary: Forward-Backward Algorithm Intialize =(A,B) Compute , , using observations Estimate new ’=(A,B) Replace with ’ If not converged go to 2 Speech and Language Processing - Jurafsky and Martin 44 46