795M Winter 200008/12/20151 Hidden Markov Models Chris Brew The Ohio State University.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Transformations We want to be able to make changes to the image larger/smaller rotate move This can be efficiently achieved through mathematical operations.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Segmentation and Fitting Using Probabilistic Methods
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models John Goldsmith. Markov model A markov model is a probabilistic model of symbol sequences in which the probability of the current.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Hidden Markov Models.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Class 5 Hidden Markov models. Markov chains Read Durbin, chapters 1 and 3 Time is divided into discrete intervals, t i At time t, system is in one of.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Text Models. Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging.
7-Speech Recognition Speech Recognition Concepts
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Hidden Markov Models CBB 231 / COMPSCI 261 part 2.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models BMI/CS 576
CSC 594 Topics in AI – Natural Language Processing
LECTURE 15: REESTIMATION, EM AND MIXTURES
Bioinformatics Algorithms and Data Structures
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

795M Winter /12/20151 Hidden Markov Models Chris Brew The Ohio State University

795M Winter /12/20152 Introduction Dynamic Programming Markov models as effective tools for language modelling How to solve three classic problems Calculate the probability of a corpus given a model Guess the sequence of states passed through Adapt the model to the corpus Generalization of word-confetti

795M Winter /12/20153 Edit Distance You have a text that can do Insert a character Delete a character Substitute one character for another The edit distance between two sequences x 1 …x n, y 1 …y m is the smallest number of elementary operations that will transform x 1 …x n into y 1 …y m

795M Winter /12/20154 Algorithm for edit distance Fill up a rectangular array of intermediate results starting at the bottom left and working up to the top right. This is time efficient, because it avoids backtracking. It can be made space efficient, because not all the entries in the array are relevant to the best path

795M Winter /12/20155 Initialization 0 P H O S S CHEAP

795M Winter /12/20156 Initialization def sdist(string1,string2): delCost = 1.0 insCost = 1.0 substCost = 1.0 m = len(string1) n = len(string2) d[0][0] = 0.0 … This code is not a complete program, needs imports and so on.

795M Winter /12/20157 The borders P H O S S CHEAP

795M Winter /12/20158 The borders We fill in the first row, adding entries with indices (1,0) through (m,0) … for i in range(m): d[i+1,0] = d[I,0] + delCost … We fill in the first column, adding entries with indices (0,1) through (0,n) … for j in range(m): d[0,j+1] = d[0,j] + insCost …

795M Winter /12/20159 Recursion P H O S S CHEAP

795M Winter /12/ Recursion for i in range(m): for j in range(n): if string1[I] == string2[j]: subst = 0 else: subst = substCost d[i+1,j+1] = min( d[i,j] + subst, d[i+1,j]+ insCost, d[i,j+1]+ delCost)

795M Winter /12/ Wrapup At the end, the total distance is in the cell at (m,n). This version says that there is no charge for matching a letter against itself, but that it costs one penalty point to match against anything else. It would be easy to vary this if we thought, for example, that it was less bad to confuse some letter pairs than to confuse others.

795M Winter /12/ Dynamic Programming Like many other algorithms, DP is efficient because it systematically records intermediate results. There are actually exponentially many paths through the matrix, but only a polynomial amount of effort is needed to fill it out. If you’re clever, no need to fill all the cells

795M Winter /12/ Topics The noisy channel model Markov models Hidden Markov models What is Part of speech tagging? Three problems solved Probability estimation (problem 1) Viterbi algorithm (problem 2) Forward-Backward algorithm (problem 3)

795M Winter /12/ The noisy channel model Incomplete information Noisy Channel Words only Words + Parts-of-speech

795M Winter /12/ Markov Models States and transitions (with probabilities) the dogs bit

795M Winter /12/ Matrix form of Markov models Transition Matrix(A) The Dogs Bit The Dogs Bit Start with initial probabilities p(0) The 0.7 Dogs 0.2 Bit 0.1

795M Winter /12/ Using Markov models Choose initial state from p(0). Say it was “the” Choose transition from “the” row of A. If we choose “dogs” that has probability But we can get to “dogs” from other places too. p(1)[“dogs”] =p(0)[“the”]*0.46+p(0)[“dogs”]*0.15+p(0)[“bit”]*0.32 After N time steps p(n) =A N p(0)

795M Winter /12/ Using Markov models II If we want the whole of p(1) we can do it efficiently by multiplying the matrix A by the vector p(0). We can do the same to get p(2) from p(1) After N time steps p(n) =A N p(0) Best path and string probability also not hard.

795M Winter /12/ Hidden Markov Models Now you don’t know the state sequence det vb n these a the dogs bit cats dogs bit chased

795M Winter /12/ Matrix form of HMMs Transition Matrix(A) Emission Matrix (B) DET N VB Dogs Bit The … DET DET N N VB VB Start with initial probabilities p(0) Det 0.7 N 0.2 VB 0.1

795M Winter /12/ Using Hidden Markov models Generation: Draw from p(0) Choose transition from relevant row of A Choose emission from relevant row of B After N time steps p(n) =A N p(0) Easy because state stays known. If one wanted, one could generate all possible strings, annotating with probability.

795M Winter /12/ State sequences All you see is the output: “The bit dogs …” But you can’t tell which of DET N VB … DET VB N … DET N N … DET VB VB … Each of these has different probabilities. Don’t know which state you are in.

795M Winter /12/ The three problems Probability estimation Given a sequence of observations O and a model M. Find P(O|M) Best path estimation Given a sequence of observations O and a model M, find a sequence of states I which maximizes P(O,I|M).

795M Winter /12/ The third problem Training Adjust the model parameters so that P(O|M) is as large as possible for given O. Hard problem because there are so many adjustable parameters which could vary

795M Winter /12/ Probability estimation Easy in principle. Form joint probability of state sequences and observations P(O,I|M). Marginalize out I. But this involves sum over exponentially many paths. Efficient algorithm uses idea that probability of state at time t+1 is easy to get from knowledge of all states at time t.

795M Winter /12/ Probability estimation Getting the next time step dogs b i (dogs)  i (t+1)  DET (t)  VB (t)  N (t) b det (bit) bit b vb (bit) b n (bit) a vb,i a det,i a n,i

795M Winter /12/ Event 1 Arrive in state j at time step t. (big event)

795M Winter /12/ Event 2 Generate word k from state j

795M Winter /12/ Event 3 Transition from state j to state i

795M Winter /12/ Event 4 Continue to from j to end of string (big event)

795M Winter /12/ Best path dogs b i (dogs)  i (t+1)  DET (t)  VB (t)  N (t) b det (bit) bit b vb (bit) b n (bit) a vb,i a det,i a n,i Maximize not sum

795M Winter /12/ Backward probabilities Counterpart of forward probs bitcat  det (t+2) dogs b i (dogs)  i (t+1)  DET (t)  VB (t)  N (t) b det (bit) b vb (bit) b n (bit) a vb,i a det,i a n,i  i (t+1) a i,vb a i,det a i,n b det (cat) b vb (cat) b n (cat)  n (t+2)  vb (t+2)

795M Winter /12/ Forward and Backward Note that our notation is not quite the same as that in M&S p334. Ours is a state-emission HMM, theirs is an arc-emission HMM. See the note on p338 for more details. We assume that  i (t) includes the probability of generating words up to but not including the one in the state just reached.  i (t) therefore starts by generating this word

795M Winter /12/ State probabilities  i (t)  i (t) is p(in state i at time t, all words) Sum over all states k of  k (t)  k (t) is p(sentence) p(in state i at time t) is  i (t)  i (t)/ (Sum k  k (t)  k (t) ) p(in state i) average over all time ticks of p(in state i at time t)

795M Winter /12/ Training Uses forward and backward probabilities Starts from an initial guess Improves the initial guess using data Stops at a (locally) best model Specialization of the EM algorithm

795M Winter /12/ Factorizing the path Consider p(in state i at time t and in state j at time t+1| Model,Observations) We could see this as two things Get to i while generating words up to t * Get from t to end of corpus while generating remaining words.

795M Winter /12/ Factorizing the path 2 Consider p(in state i at time t and in state j at time t+1| Model,Observations) We could see this as four things Get to i while generating words up to t Generate word from i Make correct transition from i to j Get from t+1 to end of corpus while generating remaining words. The merit of this is that we can use the current model for the inside bit.

795M Winter /12/ Factorizing the path 3 Consider p(in state i at time t and in state j at time t+1| Model,Observations) We could see this as four things Get to i while generating words up to t Repeat ad lib Generate word from current state Make a transition that generates the word that we saw Get from t+k to end of corpus while generating remaining words. If we wanted, the model for the inside bit could be a bit more complicated than we assumed above. Research topic.

795M Winter /12/ Expected transition counts 2 We have these things already Forward prob:  i (t) Transition prob: a ij Emission prob: b j (word) Backward prob  j (t+1)

795M Winter /12/ Expected Transition counts 3 dogs b J (dogs)  i (t) b det (...) b vb (...) b n (...) a vb,i a det,i a n,i  i (t+1) a i,vb a i,det a i,n b det (cat) b vb (cat) b n (cat) bit b i (bit) a i,j

795M Winter /12/ Estimated transition probabilities  i (t)a ij b j (word)  j(t+1) is count(in state i at time t,in state j at time t+1, all words) p(in state i at time t,in state j at time t+1) is  i (t)a ij b j (word)  j(t+1) / (Sum k  i (t)a ik b k (word)  k(t+1) ) Sum over all time ticks to get expected transition counts. Derive new probabilities from these counts.

795M Winter /12/ Estimated emission probabilities Calculate expected number of times in state j at places where particular word happened. Divide expected number of times in state j average over time ticks is new emission probability.

795M Winter /12/ Re-estimation (for everybody) Recall that we guessed the initial parameters. Replace initial parameters with new ones derived as above. These will be better than the originals because: The data ensures that we only consider paths which can generate the words that we did see in the corpus Paths which fit the data well get taken frequently, bad paths infrequently

795M Winter /12/ Re-estimation (the details) Baum et al. show that this will always converge to a local maximum. An instance of Dempster Laird and Rubin’s EM algorithm. For a modern review of EM see: ftp://ftp.cs.utoronto.ca/pub/radford/emk.pdf

795M Winter /12/ Summary Three problems solved Simple model based on finite-state technology Sensitive to a limited range of context information Re-estimation as in instance of the EM algorithm

795M Winter /12/ Where to get more information Maryland implementation in C My implementation in Python Matlab code by Zoubin Ghahramani Manning and Schütze ch 9. Charniak chapters 3 and 4 _overview.html