CSCE 771 Natural Language Processing Lecture 8 Training HMMs: Learning the Model Forward- Backward Algorithm Topics Overview Readings: Chapter 6 February 11, 2013
Overview Last Time Today Hidden Markov Models revisited Chapter 6 The three problems Likelihood Decoding Training – learning the model NLTK book – chapter 5 tagging Videos on NLP on You Tube from Coursesa etc. Today Computation Complexity for Problems 1 and 2 Straight forward – sum over all possible state sequences O(NT) Dynamic Programming - Forward algorithm O(T*N2) Problem 3 - Learning the model Backward computation Forward-Backward Algorithm
Ferguson’s 3 Fundamental Problems Computing Likelihood – Given an HMM λ = (A, B) and an observation sequence O, determine the likelihood P(O| λ). The Decoding Problem– Given an HMM λ = (A, B) and an observation sequence O= o1, o2, … oT, find the most probable sequence of states Q = q1, q2, … qT. Learning the Model (HMM) – Given an observation sequence and the set of possible states in the HMM, learn parameters A and B.
Problem 1: Computing Likelihood Computing Likelihood – Given an HMM λ = (A, B) and an observation sequence O, determine the likelihood P(O| λ). Example P(O | Q) = P(3 1 3 | H H C) = Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.
Likelihood Computation P(O,Q) = P(O | Q) P(Q) = Π P(oi | qi) * Π P(qi | qi-1) So P(313, hot hot cold) = In general Which sums over all sequences of states.
Likelihood Computation Performance Exponential sum over all sequences of states Dynamic programming to the rescue Forward algorithm O(N2T) Compute array αt(j) for each state j and each time t αt-1(i) - the previous forward probability A = aij – the state transition probabilities B = bj(ot) – the state obsevation likelihood of observation ot given the current state j
Forward Trellis – (Structure like Viterbi) . Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.
Forward Algorithm . Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.
Problem 2: The Decoding Problem The Decoding Problem– Given an HMM λ = (A, B) and an observation sequence O= o1, o2, … oT, find the most probable sequence of states Q = q1, q2, … qT. Viterbi Revisited matching tags to words Matching spectral features(voice) to phonemes Matching Hot/Cold to #ice creams eaten
Viterbi – the man Andrew James Viterbi, Ph.D. (Bergamo (Italy) March 9, 1935) is an Italian-American electrical engineer and businessman. BS, MS MIT, PhD Southern Cal. In 1967 he invented the Viterbi algorithm, which he used for decoding convolutionally encoded data. Used widely in cellular phones for error correcting codes, as well as for speech recognition, DNA analysis, Viterbi School of Engineering, So. Cal. his $52 million donation to the school Code Division Multiple Access (CDMA) wireless technology
Viterbi Backtrace Pointers . Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.
Problem 3: Learning the Model (HMM) Learning the Model (HMM) – Given an observation sequence and the set of possible states in the HMM, learn parameters A and B. Markov Chain case
Markov Chain case
Forward-Backward Algorithm The Forward- Backward algorithm is sometimes called the Baum-Welch algorithm. Two ideas: Iteratively estimate counts/probabilities Compute probability using forward algorithm then distribute probability mass over different paths that contributed
Forward-Backward Algorithm The Forward- Backward algorithm is sometimes called the Baum-Welch algorithm. Forward probabilities Backward probabilities Recall Bayes
Figure 6.13 Computation of backward Probabilities βt(i) Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.
Algorithm for backward Probabilities βt(i)
6.14 Joint Probability qt = i and qt+1 = j Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.
Estimating aij Estimating aij aij http://www.greek-language.com/alphabet/
Calculating ξt(i, j) from not-quite ξt(i, j) Using we divide not-quite ξt(i, j) by P(O| λ)
Now P(O | λ) is The forward probability of the entire sequence αT(N) or The backward probability of the entire sequence β T(1) Thus yielding
Finally for aij
Probability γt(j) of being at state j at time t So we need to be able to estimate γt(j) = P(qt = j | O, λ) Again using Bayes and moving O into the joint prob. we obtain
Fig 6.15 Computation of γt(j) probability of being at state j at time t Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.
Probability γt(j) of being at state j at time t So we need to be able to estimate γt(j) = P(qt = j | O, λ) Again using Bayes and moving O into the joint prob. we obtain
Finally (Eq 6.43) .
Forward-Backward Algorithm Description Then the forward-backward algorithm has 0. An initialization of A = (aij) and B = (bt (j)) A loop with an Estimation step in which ξt(i, j) and γt(j) are estimated, and a Maximization step in which new estimates of A and B are computed Until convergence of ??? Return A, B
Fig 6.16 Forward-Backward Algorithm Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.
Information Theory Introduction (4.10) Entropy is a measure of the information in a message Define a random variable X over whatever we are predicting (words, characters or …) then the entropy of X is given by
Horse Race Example of Entropy 8 horses: H1, H2, …H8, Want to send messages of which horse to bet on in race with as few bits as possible. We could use the bit sequences H1=000, H2=001, …H8=111, three bits per bet.
But now given a random variable B Assume our bets over the day, are modelled by a random variable B, following the distribution Horse Probability that we bet on it log2(prob) Horse 1 ½ log2(1/2) = -1 Horse 2 ¼ log2(1/4) = -2 Horse 3 1/8 log2(1/8) = -3 Horse 4 1/16 log2(1/16) = -4 Horse 5 1/64 log2(1/64) = -6 Horse 6 Horse 7 Horse 8
Horse Race Example of Entropy(cont.) Then the entropy is Horse Probability Encoding bit string Horse 1 ½ Horse 2 ¼ 10 Horse 3 1/8 110 Horse 4 1/16 1110 Horse 5 1/64 11110 Horse 6 111110 Horse 7 1111110 Horse 8 11111110
What if horses are equally likely
Entropy of Sequences
Jason (ice cream) Eisner http://www.cs.jhu.edu/~jason/papers/# 14 papers in 2012 http://videolectures.net/hltss2010_eisner_plm/video/2/
Speed of implementations of BW? The Baum-Welch algorithm for hidden Markov Models: speed comparison between octave / python / R / scilab / matlab / C / C++ http://perso.telecom-paristech.fr/~garivier/code/index.php