Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 (Warning: this section has lots of typos) CS479/679 Pattern Recognition Spring 2013.

Slides:



Advertisements
Similar presentations
Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Advertisements

CS479/679 Pattern Recognition Dr. George Bebis
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Hidden Markov Models. Room Wandering I’m going to wander around my house and tell you objects I see. Your task is to infer what room I’m in at every point.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Hidden Markov Models By Marc Sobel. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Modeling.
1 Pattern Recognition Chapter 3 Hidden Markov Models (HMMs)
Automatic Speech Recognition II  Hidden Markov Models  Neural Network.
Hidden Markov Model 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction – Markov Chain – Hidden Markov Model (HMM) Formal Definition of HMM & Problems Estimate.
HIDDEN MARKOV MODELS Prof. Navneet Goyal Department of Computer Science BITS, Pilani Presentation based on: & on presentation on HMM by Jianfeng Tang Old.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Spring 2003Data Mining by H. Liu, ASU1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
INTRODUCTION TO Machine Learning 3rd Edition
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
1 Probabilistic Reasoning Over Time (Especially for HMM and Kalman filter ) December 1 th, 2004 SeongHun Lee InHo Park Yang Ming.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Hidden Markov Model: Extension of Markov Chains
Chapter 3 (part 3): Maximum-Likelihood and Bayesian Parameter Estimation Hidden Markov Model: Extension of Markov Chains All materials used in this course.
Hidden Markov Models David Meir Blei November 1, 1999.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
THE HIDDEN MARKOV MODEL (HMM)
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Hidden Markov Models CBB 231 / COMPSCI 261 part 2.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
LECTURE 15: HMMS – EVALUATION AND DECODING
Hidden Markov Models Part 2: Algorithms
Hidden Markov Model LR Rabiner
LECTURE 14: HMMS – EVALUATION AND DECODING
CONTEXT DEPENDENT CLASSIFICATION
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 (Warning: this section has lots of typos) CS479/679 Pattern Recognition Spring 2013 – Dr. George Bebis

Sequential vs Temporal Patterns Sequential patterns: – The order of data points is irrelevant. Temporal patterns: – The order of data points is important (i.e., time series). – Data can be represented by a number of states. – States at time t are influenced directly by states in previous time steps (i.e., correlated).

Hidden Markov Models (HMMs) HMMs are appropriate for problems that have an inherent temporality. – Speech recognition – Gesture recognition – Human activity recognition

First-Order Markov Models Represented by a graph where every node corresponds to a state ω i. The graph can be fully-connected with self-loops.

First-Order Markov Models (cont’d) Links between nodes ω i and ω j are associated with a transition probability: P(ω(t+1)=ω j / ω(t)=ω i )=α ij which is the probability of going to state ω j at time t+1 given that the state at time t was ω i (first-order model).

First-Order Markov Models (cont’d) Markov models are fully described by their transition probabilities α ij The following constraints should be satisfied:

Example: Weather Prediction Model Assume three weather states: – ω 1 : Precipitation (rain, snow, hail, etc.) – ω 2 : Cloudy – ω 3 : Sunny Transition Matrix ω 1 ω 2 ω 3 ω1ω1ω2ω2ω3ω3ω1ω1ω2ω2ω3ω3 ω1ω1ω1ω1 ω2ω2ω2ω2 ω3ω3ω3ω3

Computing the probability P(ω T ) of a sequence of states ω T Given a sequence of states ω T =(ω(1), ω(2),..., ω(T)), the probability that the model generated ω T is equal to the product of the corresponding transition probabilities: where P(ω(1)/ ω(0))=P(ω(1)) is the prior probability of the first state.

Example: Weather Prediction Model (cont’d) What is the probability that the weather for eight consecutive days is: “sunny-sunny-sunny-rainy-rainy-sunny-cloudy-sunny” ? ω 8 =ω 3 ω 3 ω 3 ω 1 ω 1 ω 3 ω 2 ω 3 P(ω 8 )=P(ω 3 )P(ω 3 /ω 3 )P(ω 3 /ω 3 ) P(ω 1 /ω 3 ) P(ω 1 /ω 1 ) P( ω 3 /ω 1 ) P(ω 2 /ω 3 )P(ω 3 /ω 2 )=1.536 x 10 -4

Limitations of Markov models In Markov models, each state is uniquely associated with an observable event. Once an observation is made, the state of the system is trivially retrieved. Such systems are not of practical use for most applications.

Hidden States and Observations Assume that each state can generate a number of outputs (i.e., observations) according to some probability distribution. Each observation can potentially be generated at any state. State sequence is not directly observable (i.e., hidden) but can be approximated from observation sequence.

First-order HMMs Augment Markov model such that when it is in state ω(t) it also emits some symbol v(t) (visible state) among a set of possible symbols. We have access to the visible states v(t) only, while ω(t) are unobservable.

Example: Weather Prediction Model (cont’d) v 1 : temperature v 2 : humidity etc. Observations:

Observation Probabilities When the model is in state ω j at time t, the probability of emitting a visible state v k at that time is denoted as: P(v(t)=v k / ω(t)= ω j )=b jk where (observation probabilities) For every sequence of hidden states, there is an associated sequence of visible states: ω T =(ω(1), ω(2),..., ω(T))  V T =(v(1), v(2),..., v(T))

Absorbing State ω 0 Given a state sequence and its corresponding observation sequence: ω T =(ω(1), ω(2),..., ω(T))  V T =(v(1), v(2),..., v(T)) we assume that ω(T)=ω 0 is some absorbing state, which uniquely emits symbol v(T)=v 0 Once entering the absorbing state, the system can not escape from it.

HMM Formalism An HMM is defined by {Ω, V,  – Ω : {ω 1 … ω n } are the possible states – V : {v 1 …v m } are the possible observations –  i  are the prior state probabilities – A = {a ij } are the state transition probabilities – B = {b ik } are the observation state probabilities

Some Terminology Causal: the probabilities depend only upon previous states. Ergodic: Given some starting state, every one of the states has a non-zero probability of occurring. “left-right” HMM

Coin toss example You are in a room with a barrier (e.g., a curtain) through which you cannot see what is happening on the other side. On the other side of the barrier is another person who is performing a coin (or multiple coin) toss experiment. The other person will tell you only the result of the experiment, not how he obtained that result. e.g., V T =HHTHTTHH...T=v(1),v(2),..., v(T)

Coin toss example (cont’d) Problem: derive an HMM model to explain the observed sequence of heads and tails. – The coins represent the hidden states since we do not know which coin was tossed each time. – The outcome of each toss represents an observation. – A “likely” sequence of coins (state sequence) may be inferred from the observations. – The state sequence might not be unique in general.

Coin toss example: 1-fair coin model There are 2 states, each associated with either heads (state1) or tails (state2). Observation sequence uniquely defines the states (i.e., states are not hidden). observation probabilities

Coin toss example: 2-fair coins model There are 2 states, each associated with a coin; a third coin is used to decide which of the fair coins to flip. Neither state is uniquely associated with either heads or tails. observation probabilities

Coin toss example: 2-biased coins model There are 2 states, each associated with a biased coin; a third coin is used to decide which of the biased coins to flip. Neither state is uniquely associated with either heads or tails. observation probabilities

Coin toss example:3-biased coins model There are 3 states, each state associated with a biased coin; we decide which coin to flip using some way (e.g., other coins). Neither state is uniquely associated with either heads or tails. observation probabilities

Which model is best? Since the states are not observable, the best we can do is to select the model θ that best explains the observations: max θ P(V T / θ) Long observation sequences are typically better in selecting the best model.

Classification Using HMMs Given an observation sequence V T and set of possible models θ, choose the model with the highest probability P(θ / V T ). Bayes rule:

Three basic HMM problems Evaluation – Determine the probability P(V T ) that a particular sequence of visible states V T was generated by a given model (i.e., Forward/Backward algorithm). Decoding – Given a sequence of visible states V T, determine the most likely sequence of hidden states ω T that led to those observations (i.e., using Viterbi algorithm). Learning – Given a set of visible observations, determine a ij and b jk (i.e., using EM algorithm - Baum-Welch algorithm).

Evaluation The probability that a model produces V T can be computed using the theorem of total probability: where ω r T =(ω(1), ω(2),..., ω(T)) is a possible state sequence and r max is the max number of state sequences. For a model with c states ω 1, ω 2,..., ω c, r max =c T

Evaluation (cont’d) We can rewrite each term as follows: Combining the two equations we have:

Evaluation (cont’d) Given a ij and b jk, it is straightforward to compute P(V T ). What is the computational complexity? O(T r max )= O(T c T )

Recursive computation of P(V T ) (HMM Forward) v(T) v(1)v(t) v(t+1) ω(1)ω(t)ω(t+1)ω(T) ωiωiωiωi ωjωjωjωj...

Recursive computation of P(V T ) (HMM Forward) (cont’d) or using marginalization:

Recursive computation of P(V T ) (HMM Forward) (cont’d) ω0ω0ω0ω0

(i.e., corresponds to state ω(T)=ω 0 ) for j=1 to c do What is the computational complexity in this case? O(T c 2 ) (if t=T, j=0)

Example ω 0 ω 1 ω 2 ω 3 ω 0 ω 1 ω 2 ω 3 ω0ω0ω1ω1ω2ω2ω3ω3ω0ω0ω1ω1ω2ω2ω3ω3 ω0ω0ω1ω1ω2ω2ω3ω3ω0ω0ω1ω1ω2ω2ω3ω3

Example (cont’d) Similarly for t=2,3,4 Finally: V T =v 1 v 3 v 2 v initial state

Recursive computation of P(V T ) (HMM backward) v(1) ω(1) ω(t)ω(t+1)ω(T) v(t)v(t+1)v(T)... ωiωiωiωi ωjωjωjωj β j (t+1) /ω (t+1)=ω j ) β i (t) i ωiωiωiωi

Recursive computation of P(V T ) (HMM backward) (cont’d) =ω j )) or i v(1) ω(1) ω(t)ω(t+1)ω(T) v(t)v(t+1)v(T) ωiωiωiωi ωjωjωjωj

Recursive computation of P(V T ) (HMM backward) (cont’d)

Decoding Find the most probable sequence of hidden states. Use an optimality criterion - different optimality criteria lead to different solutions. Algorithm 1: choose the states ω(t) which are individually most likely.

Decoding – Algorithm 1

Decoding (cont’d) Algorithm 2: at each time step t, find the state that has the highest probability α i (t) (i.e., use forward algorithm with minor changes).

Decoding – Algorithm 2

Decoding – Algorithm 2 (cont’d)

There is no guarantee that the path is a valid one. The path might imply a transition that is not allowed by the model. not allowed since ω 32 = Example:

Decoding (cont’d) Algorithm 3: find the single best sequence ω T by maximizing P(ω T /V T ) This is the most widely used algorithm known as Viterbi algorithm.

Decoding – Algorithm 3 maximize: P(ω T /V T )

Decoding – Algorithm 3 (cont’d) recursion (similar to Forward Algorithm, except that it uses maximization over previous states instead of summation\)

Learning Determine the transition and emission probabilities a ij and b jv from a set of training examples (i.e., observation sequences V 1 T, V 2 T,..., V n T ). There is no known way to find the ML solution analytically. – It would be easy if we knew the hidden states – Hidden variable problem  use EM algorithm!

Learning (cont’d) EM algorithm – Update a ij and b jk iteratively to better explain the observed training sequences. V: V 1 T, V 2 T,..., V n T Expectation step: p(ω T /V, θ) Maximization step: θ t+1 =argmax θ E[log p(ω T,V T / θ)/ V T, θ t ]

Learning (cont’d) Updating transition/emission probabilities:

Learning (cont’d) Define the probability of transitioning from ω i to ω j at step t given V T : Expectation step

Learning (cont’d) t-1 α i (t-1) β j (t) t a ij b jv(t)

Learning (cont’d) Maximization step

Learning (cont’d) Maximization step

Practical Problems How do we decide on the number of states and the structure of the model? – Use domain knowledge otherwise very hard problem! What about the size of observation sequence ? – Should be sufficiently long to guarantee that all state transitions will appear a sufficient number of times. – A large number of training data is necessary to learn the HMM parameters.