Presentation is loading. Please wait.

Presentation is loading. Please wait.

795M Winter 200008/12/20151 Hidden Markov Models Chris Brew The Ohio State University.

Similar presentations


Presentation on theme: "795M Winter 200008/12/20151 Hidden Markov Models Chris Brew The Ohio State University."— Presentation transcript:

1 795M Winter 200008/12/20151 Hidden Markov Models Chris Brew The Ohio State University

2 795M Winter 200008/12/20152 Introduction Dynamic Programming Markov models as effective tools for language modelling How to solve three classic problems Calculate the probability of a corpus given a model Guess the sequence of states passed through Adapt the model to the corpus Generalization of word-confetti

3 795M Winter 200008/12/20153 Edit Distance You have a text that can do Insert a character Delete a character Substitute one character for another The edit distance between two sequences x 1 …x n, y 1 …y m is the smallest number of elementary operations that will transform x 1 …x n into y 1 …y m

4 795M Winter 200008/12/20154 Algorithm for edit distance Fill up a rectangular array of intermediate results starting at the bottom left and working up to the top right. This is time efficient, because it avoids backtracking. It can be made space efficient, because not all the entries in the array are relevant to the best path

5 795M Winter 200008/12/20155 Initialization 0 P H O S S CHEAP

6 795M Winter 200008/12/20156 Initialization def sdist(string1,string2): delCost = 1.0 insCost = 1.0 substCost = 1.0 m = len(string1) n = len(string2) d[0][0] = 0.0 … This code is not a complete program, needs imports and so on.

7 795M Winter 200008/12/20157 The borders 5 4 3 2 1 012345 P H O S S CHEAP

8 795M Winter 200008/12/20158 The borders We fill in the first row, adding entries with indices (1,0) through (m,0) … for i in range(m): d[i+1,0] = d[I,0] + delCost … We fill in the first column, adding entries with indices (0,1) through (0,n) … for j in range(m): d[0,j+1] = d[0,j] + insCost …

9 795M Winter 200008/12/20159 Recursion 554444 443333 332234 221234 112345 012345 P H O S S CHEAP

10 795M Winter 200008/12/201510 Recursion for i in range(m): for j in range(n): if string1[I] == string2[j]: subst = 0 else: subst = substCost d[i+1,j+1] = min( d[i,j] + subst, d[i+1,j]+ insCost, d[i,j+1]+ delCost)

11 795M Winter 200008/12/201511 Wrapup At the end, the total distance is in the cell at (m,n). This version says that there is no charge for matching a letter against itself, but that it costs one penalty point to match against anything else. It would be easy to vary this if we thought, for example, that it was less bad to confuse some letter pairs than to confuse others.

12 795M Winter 200008/12/201512 Dynamic Programming Like many other algorithms, DP is efficient because it systematically records intermediate results. There are actually exponentially many paths through the matrix, but only a polynomial amount of effort is needed to fill it out. If you’re clever, no need to fill all the cells

13 795M Winter 200008/12/201513 Topics The noisy channel model Markov models Hidden Markov models What is Part of speech tagging? Three problems solved Probability estimation (problem 1) Viterbi algorithm (problem 2) Forward-Backward algorithm (problem 3)

14 795M Winter 200008/12/201514 The noisy channel model Incomplete information Noisy Channel Words only Words + Parts-of-speech

15 795M Winter 200008/12/201515 Markov Models States and transitions (with probabilities) the dogs bit

16 795M Winter 200008/12/201516 Matrix form of Markov models Transition Matrix(A) The Dogs Bit The 0.01 0.46 0.53 Dogs 0.05 0.15 0.80 Bit 0.77 0.32 0.01 Start with initial probabilities p(0) The 0.7 Dogs 0.2 Bit 0.1

17 795M Winter 200008/12/201517 Using Markov models Choose initial state from p(0). Say it was “the” Choose transition from “the” row of A. If we choose “dogs” that has probability 0.46. But we can get to “dogs” from other places too. p(1)[“dogs”] =p(0)[“the”]*0.46+p(0)[“dogs”]*0.15+p(0)[“bit”]*0.32 After N time steps p(n) =A N p(0)

18 795M Winter 200008/12/201518 Using Markov models II If we want the whole of p(1) we can do it efficiently by multiplying the matrix A by the vector p(0). We can do the same to get p(2) from p(1) After N time steps p(n) =A N p(0) Best path and string probability also not hard.

19 795M Winter 200008/12/201519 Hidden Markov Models Now you don’t know the state sequence det vb n these a the dogs bit cats dogs bit chased

20 795M Winter 200008/12/201520 Matrix form of HMMs Transition Matrix(A) Emission Matrix (B) DET N VB Dogs Bit The … DET 0.01 0.89 0.10 DET 0.0 0.0 1.0 N 0.30 0.20 0.50 N 0.2 0.1 0.0 VB 0.67 0.23 0.10 VB 0.1 0.6 0.0 Start with initial probabilities p(0) Det 0.7 N 0.2 VB 0.1

21 795M Winter 200008/12/201521 Using Hidden Markov models Generation: Draw from p(0) Choose transition from relevant row of A Choose emission from relevant row of B After N time steps p(n) =A N p(0) Easy because state stays known. If one wanted, one could generate all possible strings, annotating with probability.

22 795M Winter 200008/12/201522 State sequences All you see is the output: “The bit dogs …” But you can’t tell which of DET N VB … DET VB N … DET N N … DET VB VB … Each of these has different probabilities. Don’t know which state you are in.

23 795M Winter 200008/12/201523 The three problems Probability estimation Given a sequence of observations O and a model M. Find P(O|M) Best path estimation Given a sequence of observations O and a model M, find a sequence of states I which maximizes P(O,I|M).

24 795M Winter 200008/12/201524 The third problem Training Adjust the model parameters so that P(O|M) is as large as possible for given O. Hard problem because there are so many adjustable parameters which could vary

25 795M Winter 200008/12/201525 Probability estimation Easy in principle. Form joint probability of state sequences and observations P(O,I|M). Marginalize out I. But this involves sum over exponentially many paths. Efficient algorithm uses idea that probability of state at time t+1 is easy to get from knowledge of all states at time t.

26 795M Winter 200008/12/201526 Probability estimation Getting the next time step dogs b i (dogs)  i (t+1)  DET (t)  VB (t)  N (t) b det (bit) bit b vb (bit) b n (bit) a vb,i a det,i a n,i

27 795M Winter 200008/12/201527 Event 1 Arrive in state j at time step t. (big event)

28 795M Winter 200008/12/201528 Event 2 Generate word k from state j

29 795M Winter 200008/12/201529 Event 3 Transition from state j to state i

30 795M Winter 200008/12/201530 Event 4 Continue to from j to end of string (big event)

31 795M Winter 200008/12/201531 Best path dogs b i (dogs)  i (t+1)  DET (t)  VB (t)  N (t) b det (bit) bit b vb (bit) b n (bit) a vb,i a det,i a n,i Maximize not sum

32 795M Winter 200008/12/201532 Backward probabilities Counterpart of forward probs bitcat  det (t+2) dogs b i (dogs)  i (t+1)  DET (t)  VB (t)  N (t) b det (bit) b vb (bit) b n (bit) a vb,i a det,i a n,i  i (t+1) a i,vb a i,det a i,n b det (cat) b vb (cat) b n (cat)  n (t+2)  vb (t+2)

33 795M Winter 200008/12/201533 Forward and Backward Note that our notation is not quite the same as that in M&S p334. Ours is a state-emission HMM, theirs is an arc-emission HMM. See the note on p338 for more details. We assume that  i (t) includes the probability of generating words up to but not including the one in the state just reached.  i (t) therefore starts by generating this word

34 795M Winter 200008/12/201534 State probabilities  i (t)  i (t) is p(in state i at time t, all words) Sum over all states k of  k (t)  k (t) is p(sentence) p(in state i at time t) is  i (t)  i (t)/ (Sum k  k (t)  k (t) ) p(in state i) average over all time ticks of p(in state i at time t)

35 795M Winter 200008/12/201535 Training Uses forward and backward probabilities Starts from an initial guess Improves the initial guess using data Stops at a (locally) best model Specialization of the EM algorithm

36 795M Winter 200008/12/201536 Factorizing the path Consider p(in state i at time t and in state j at time t+1| Model,Observations) We could see this as two things Get to i while generating words up to t * Get from t to end of corpus while generating remaining words.

37 795M Winter 200008/12/201537 Factorizing the path 2 Consider p(in state i at time t and in state j at time t+1| Model,Observations) We could see this as four things Get to i while generating words up to t Generate word from i Make correct transition from i to j Get from t+1 to end of corpus while generating remaining words. The merit of this is that we can use the current model for the inside bit.

38 795M Winter 200008/12/201538 Factorizing the path 3 Consider p(in state i at time t and in state j at time t+1| Model,Observations) We could see this as four things Get to i while generating words up to t Repeat ad lib Generate word from current state Make a transition that generates the word that we saw Get from t+k to end of corpus while generating remaining words. If we wanted, the model for the inside bit could be a bit more complicated than we assumed above. Research topic.

39 795M Winter 200008/12/201539 Expected transition counts 2 We have these things already Forward prob:  i (t) Transition prob: a ij Emission prob: b j (word) Backward prob  j (t+1)

40 795M Winter 200008/12/201540 Expected Transition counts 3 dogs b J (dogs)  i (t) b det (...) b vb (...) b n (...) a vb,i a det,i a n,i  i (t+1) a i,vb a i,det a i,n b det (cat) b vb (cat) b n (cat) bit b i (bit) a i,j

41 795M Winter 200008/12/201541 Estimated transition probabilities  i (t)a ij b j (word)  j(t+1) is count(in state i at time t,in state j at time t+1, all words) p(in state i at time t,in state j at time t+1) is  i (t)a ij b j (word)  j(t+1) / (Sum k  i (t)a ik b k (word)  k(t+1) ) Sum over all time ticks to get expected transition counts. Derive new probabilities from these counts.

42 795M Winter 200008/12/201542 Estimated emission probabilities Calculate expected number of times in state j at places where particular word happened. Divide expected number of times in state j average over time ticks is new emission probability.

43 795M Winter 200008/12/201543 Re-estimation (for everybody) Recall that we guessed the initial parameters. Replace initial parameters with new ones derived as above. These will be better than the originals because: The data ensures that we only consider paths which can generate the words that we did see in the corpus Paths which fit the data well get taken frequently, bad paths infrequently

44 795M Winter 200008/12/201544 Re-estimation (the details) Baum et al. show that this will always converge to a local maximum. An instance of Dempster Laird and Rubin’s EM algorithm. For a modern review of EM see: ftp://ftp.cs.utoronto.ca/pub/radford/emk.pdf

45 795M Winter 200008/12/201545 Summary Three problems solved Simple model based on finite-state technology Sensitive to a limited range of context information Re-estimation as in instance of the EM algorithm

46 795M Winter 200008/12/201546 Where to get more information Maryland implementation in C My implementation in Python Matlab code by Zoubin Ghahramani Manning and Schütze ch 9. Charniak chapters 3 and 4 http://www.georgetown.edu/cball/ling361/tagging _overview.html


Download ppt "795M Winter 200008/12/20151 Hidden Markov Models Chris Brew The Ohio State University."

Similar presentations


Ads by Google