CSCE 771 Natural Language Processing

CSCE 771 Natural Language Processing
Lecture 8 Training HMMs: Learning the Model Forward- Backward Algorithm Topics Overview Readings: Chapter 6 February 11, 2013

Overview Last Time Today Hidden Markov Models revisited Chapter 6
The three problems Likelihood Decoding Training – learning the model NLTK book – chapter 5 tagging Videos on NLP on You Tube from Coursesa etc. Today Computation Complexity for Problems 1 and 2 Straight forward – sum over all possible state sequences O(NT) Dynamic Programming - Forward algorithm O(T*N2) Problem 3 - Learning the model Backward computation Forward-Backward Algorithm

Ferguson’s 3 Fundamental Problems
Computing Likelihood – Given an HMM λ = (A, B) and an observation sequence O, determine the likelihood P(O| λ). The Decoding Problem– Given an HMM λ = (A, B) and an observation sequence O= o1, o2, … oT, find the most probable sequence of states Q = q1, q2, … qT. Learning the Model (HMM) – Given an observation sequence and the set of possible states in the HMM, learn parameters A and B.

Problem 1: Computing Likelihood
Computing Likelihood – Given an HMM λ = (A, B) and an observation sequence O, determine the likelihood P(O| λ). Example P(O | Q) = P(3 1 3 | H H C) = Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Likelihood Computation
P(O,Q) = P(O | Q) P(Q) = Π P(oi | qi) * Π P(qi | qi-1) So P(313, hot hot cold) = In general Which sums over all sequences of states.

Likelihood Computation Performance
Exponential sum over all sequences of states Dynamic programming to the rescue Forward algorithm O(N2T) Compute array αt(j) for each state j and each time t αt-1(i) - the previous forward probability A = aij – the state transition probabilities B = bj(ot) – the state obsevation likelihood of observation ot given the current state j

Forward Trellis – (Structure like Viterbi)
. Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Forward Algorithm . Speech and Language Processing, Second Edition
Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Problem 2: The Decoding Problem
The Decoding Problem– Given an HMM λ = (A, B) and an observation sequence O= o1, o2, … oT, find the most probable sequence of states Q = q1, q2, … qT. Viterbi Revisited matching tags to words Matching spectral features(voice) to phonemes Matching Hot/Cold to #ice creams eaten

Viterbi – the man Andrew James Viterbi, Ph.D. (Bergamo (Italy) March 9, 1935) is an Italian-American electrical engineer and businessman. BS, MS MIT, PhD Southern Cal. In 1967 he invented the Viterbi algorithm, which he used for decoding convolutionally encoded data. Used widely in cellular phones for error correcting codes, as well as for speech recognition, DNA analysis, Viterbi School of Engineering, So. Cal. his $52 million donation to the school Code Division Multiple Access (CDMA) wireless technology

Viterbi Backtrace Pointers
. Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Problem 3: Learning the Model (HMM)
Learning the Model (HMM) – Given an observation sequence and the set of possible states in the HMM, learn parameters A and B. Markov Chain case

Markov Chain case

Forward-Backward Algorithm
The Forward- Backward algorithm is sometimes called the Baum-Welch algorithm. Two ideas: Iteratively estimate counts/probabilities Compute probability using forward algorithm then distribute probability mass over different paths that contributed

Forward-Backward Algorithm
The Forward- Backward algorithm is sometimes called the Baum-Welch algorithm. Forward probabilities Backward probabilities Recall Bayes

Figure 6.13 Computation of backward Probabilities βt(i)
Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Algorithm for backward Probabilities βt(i)

6.14 Joint Probability qt = i and qt+1 = j

Estimating aij Estimating aij aij

Calculating ξt(i, j) from not-quite ξt(i, j)
Using we divide not-quite ξt(i, j) by P(O| λ)

Now P(O | λ) is The forward probability of the entire sequence αT(N) or The backward probability of the entire sequence β T(1) Thus yielding

Finally for aij

Probability γt(j) of being at state j at time t
So we need to be able to estimate γt(j) = P(qt = j | O, λ) Again using Bayes and moving O into the joint prob. we obtain

Fig 6.15 Computation of γt(j) probability of being at state j at time t

Probability γt(j) of being at state j at time t
So we need to be able to estimate γt(j) = P(qt = j | O, λ) Again using Bayes and moving O into the joint prob. we obtain

Finally (Eq 6.43) .

Forward-Backward Algorithm Description
Then the forward-backward algorithm has 0. An initialization of A = (aij) and B = (bt (j)) A loop with an Estimation step in which ξt(i, j) and γt(j) are estimated, and a Maximization step in which new estimates of A and B are computed Until convergence of ??? Return A, B

Fig 6.16 Forward-Backward Algorithm

Information Theory Introduction (4.10)
Entropy is a measure of the information in a message Define a random variable X over whatever we are predicting (words, characters or …) then the entropy of X is given by

Horse Race Example of Entropy
8 horses: H1, H2, …H8, Want to send messages of which horse to bet on in race with as few bits as possible. We could use the bit sequences H1=000, H2=001, …H8=111, three bits per bet.

But now given a random variable B
Assume our bets over the day, are modelled by a random variable B, following the distribution Horse Probability that we bet on it log2(prob) Horse 1 log2(1/2) = -1 Horse 2 log2(1/4) = -2 Horse 3 1/8 log2(1/8) = -3 Horse 4 1/16 log2(1/16) = -4 Horse 5 1/64 log2(1/64) = -6 Horse 6 Horse 7 Horse 8

Horse Race Example of Entropy(cont.)
Then the entropy is Horse Probability Encoding bit string Horse 1 Horse 2 10 Horse 3 1/8 110 Horse 4 1/16 1110 Horse 5 1/64 11110 Horse 6 111110 Horse 7 Horse 8

What if horses are equally likely

Entropy of Sequences

Jason (ice cream) Eisner
14 papers in

Speed of implementations of BW?
The Baum-Welch algorithm for hidden Markov Models: speed comparison between octave / python / R / scilab / matlab / C / C++

CSCE 771 Natural Language Processing

Similar presentations

Presentation on theme: "CSCE 771 Natural Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCE 771 Natural Language Processing

Similar presentations

Presentation on theme: "CSCE 771 Natural Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback