Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCE 771 Natural Language Processing

Similar presentations


Presentation on theme: "CSCE 771 Natural Language Processing"— Presentation transcript:

1 CSCE 771 Natural Language Processing
Lecture 8 Training HMMs: Learning the Model Forward- Backward Algorithm Topics Overview Readings: Chapter 6 February 11, 2013

2 Overview Last Time Today Hidden Markov Models revisited Chapter 6
The three problems Likelihood Decoding Training – learning the model NLTK book – chapter 5 tagging Videos on NLP on You Tube from Coursesa etc. Today Computation Complexity for Problems 1 and 2 Straight forward – sum over all possible state sequences O(NT) Dynamic Programming - Forward algorithm O(T*N2) Problem 3 - Learning the model Backward computation Forward-Backward Algorithm

3 Ferguson’s 3 Fundamental Problems
Computing Likelihood – Given an HMM λ = (A, B) and an observation sequence O, determine the likelihood P(O| λ). The Decoding Problem– Given an HMM λ = (A, B) and an observation sequence O= o1, o2, … oT, find the most probable sequence of states Q = q1, q2, … qT. Learning the Model (HMM) – Given an observation sequence and the set of possible states in the HMM, learn parameters A and B.

4 Problem 1: Computing Likelihood
Computing Likelihood – Given an HMM λ = (A, B) and an observation sequence O, determine the likelihood P(O| λ). Example P(O | Q) = P(3 1 3 | H H C) = Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

5 Likelihood Computation
P(O,Q) = P(O | Q) P(Q) = Π P(oi | qi) * Π P(qi | qi-1) So P(313, hot hot cold) = In general Which sums over all sequences of states.

6 Likelihood Computation Performance
Exponential sum over all sequences of states Dynamic programming to the rescue Forward algorithm O(N2T) Compute array αt(j) for each state j and each time t αt-1(i) - the previous forward probability A = aij – the state transition probabilities B = bj(ot) – the state obsevation likelihood of observation ot given the current state j

7 Forward Trellis – (Structure like Viterbi)
. Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

8 Forward Algorithm . Speech and Language Processing, Second Edition
Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

9 Problem 2: The Decoding Problem
The Decoding Problem– Given an HMM λ = (A, B) and an observation sequence O= o1, o2, … oT, find the most probable sequence of states Q = q1, q2, … qT. Viterbi Revisited matching tags to words Matching spectral features(voice) to phonemes Matching Hot/Cold to #ice creams eaten

10 Viterbi – the man Andrew James Viterbi, Ph.D. (Bergamo (Italy) March 9, 1935) is an Italian-American electrical engineer and businessman. BS, MS MIT, PhD Southern Cal. In 1967 he invented the Viterbi algorithm, which he used for decoding convolutionally encoded data. Used widely in cellular phones for error correcting codes, as well as for speech recognition, DNA analysis, Viterbi School of Engineering, So. Cal. his $52 million donation to the school Code Division Multiple Access (CDMA) wireless technology

11 Viterbi Backtrace Pointers
. Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

12 Problem 3: Learning the Model (HMM)
Learning the Model (HMM) – Given an observation sequence and the set of possible states in the HMM, learn parameters A and B. Markov Chain case

13 Markov Chain case

14 Forward-Backward Algorithm
The Forward- Backward algorithm is sometimes called the Baum-Welch algorithm. Two ideas: Iteratively estimate counts/probabilities Compute probability using forward algorithm then distribute probability mass over different paths that contributed

15 Forward-Backward Algorithm
The Forward- Backward algorithm is sometimes called the Baum-Welch algorithm. Forward probabilities Backward probabilities Recall Bayes

16 Figure 6.13 Computation of backward Probabilities βt(i)
Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

17 Algorithm for backward Probabilities βt(i)

18 6.14 Joint Probability qt = i and qt+1 = j
Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

19 Estimating aij Estimating aij aij

20 Calculating ξt(i, j) from not-quite ξt(i, j)
Using we divide not-quite ξt(i, j) by P(O| λ)

21 Now P(O | λ) is The forward probability of the entire sequence αT(N) or The backward probability of the entire sequence β T(1) Thus yielding

22 Finally for aij

23 Probability γt(j) of being at state j at time t
So we need to be able to estimate γt(j) = P(qt = j | O, λ) Again using Bayes and moving O into the joint prob. we obtain

24 Fig 6.15 Computation of γt(j) probability of being at state j at time t
Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

25 Probability γt(j) of being at state j at time t
So we need to be able to estimate γt(j) = P(qt = j | O, λ) Again using Bayes and moving O into the joint prob. we obtain

26 Finally (Eq 6.43) .

27 Forward-Backward Algorithm Description
Then the forward-backward algorithm has 0. An initialization of A = (aij) and B = (bt (j)) A loop with an Estimation step in which ξt(i, j) and γt(j) are estimated, and a Maximization step in which new estimates of A and B are computed Until convergence of ??? Return A, B

28 Fig 6.16 Forward-Backward Algorithm
Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

29 Information Theory Introduction (4.10)
Entropy is a measure of the information in a message Define a random variable X over whatever we are predicting (words, characters or …) then the entropy of X is given by

30 Horse Race Example of Entropy
8 horses: H1, H2, …H8, Want to send messages of which horse to bet on in race with as few bits as possible. We could use the bit sequences H1=000, H2=001, …H8=111, three bits per bet.

31 But now given a random variable B
Assume our bets over the day, are modelled by a random variable B, following the distribution Horse Probability that we bet on it log2(prob) Horse 1 log2(1/2) = -1 Horse 2 log2(1/4) = -2 Horse 3 1/8 log2(1/8) = -3 Horse 4 1/16 log2(1/16) = -4 Horse 5 1/64 log2(1/64) = -6 Horse 6 Horse 7 Horse 8

32 Horse Race Example of Entropy(cont.)
Then the entropy is Horse Probability Encoding bit string Horse 1 Horse 2 10 Horse 3 1/8 110 Horse 4 1/16 1110 Horse 5 1/64 11110 Horse 6 111110 Horse 7 Horse 8

33 What if horses are equally likely

34 Entropy of Sequences

35 Jason (ice cream) Eisner
14 papers in

36 Speed of implementations of BW?
The Baum-Welch algorithm for hidden Markov Models: speed comparison between octave / python / R / scilab / matlab / C / C++


Download ppt "CSCE 771 Natural Language Processing"

Similar presentations


Ads by Google