Hidden Markov Models Part 2: Algorithms

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Entropy Rates of a Stochastic Process
Hidden Markov Models Fundamentals and applications to bioinformatics.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Visual Recognition Tutorial
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Lecture 5: Learning models using EM
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistical Models for Automatic Speech Recognition Lukáš Burget.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar Dr Nazir A. Zafar Advanced Algorithms Analysis and Design.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hidden Markov Models BMI/CS 576
Today’s Lecture Neural networks Training
Learning, Uncertainty, and Information: Learning Parameters
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Histograms CSE 6363 – Machine Learning Vassilis Athitsos
Hidden Markov Models.
Mixtures of Gaussians and the EM Algorithm
Statistical Models for Automatic Speech Recognition
Hidden Markov Models - Training
Latent Variables, Mixture Models and EM
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayesian Models in Machine Learning
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Three classic HMM problems
Hidden Markov Model LR Rabiner
CONTEXT DEPENDENT CLASSIFICATION
LECTURE 15: REESTIMATION, EM AND MIXTURES
Parametric Methods Berlin Chen, 2005 References:
Introduction to HMM (cont)
Hidden Markov Models By Manish Shrivastava.
Presentation transcript:

Hidden Markov Models Part 2: Algorithms CSE 6363 – Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Hidden Markov Model An HMM consists of: A set of states 𝑠 1 , …, 𝑠 𝐾 . An initial state probability function 𝜋 𝑘 =𝑝( 𝑧 1 = 𝑠 𝑘 ). A state transition matrix 𝑨, of size 𝐾×𝐾, where 𝐴 𝑘,𝑗 =𝑝 𝑧 𝑛 = 𝑠 𝑗 𝑧 𝑛−1 = 𝑠 𝑘 ) Observation probability functions, also called emission probabilities, defined as: 𝜑 𝑘 𝑥 =𝑝 𝑥 𝑛 =𝑥 𝑧 𝑛 = 𝑠 𝑘 )

The Basic HMM Problems We will next see algorithms for the four problems that we usually want to solve when using HMMs. Problem 1: learn an HMM using training data. This involves: Learning the initial state probabilities 𝜋 𝑘 . Learning the state transition matrix 𝑨. Learning the observation probability functions 𝑃 𝑘 𝑥 . Problems 2, 3, 4 assume we have already learned an HMM. Problem 2: given a sequence of observations 𝑋= 𝑥 1 , …, 𝑥 𝑁 , compute 𝑝 𝑋 HMM): the probability of 𝑋 given the model. Problem 3: given an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 , find the hidden state sequence 𝑍= 𝑧 1 , …, 𝑧 𝑁 maximizing 𝑝 𝑍 𝑋,HMM) Problem 4: given an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 , compute, for any 𝑛, 𝑘, the probability 𝑝 𝑧 𝑛 = 𝑠 𝑘 𝑋,HMM)

The Basic HMM Problems Problem 1: learn an HMM using training data. Problem 2: given a sequence of observations 𝑋= 𝑥 1 , …, 𝑥 𝑁 , compute 𝑝 𝑋 HMM): the probability of 𝑋 given the model. Problem 3: given an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 , find the hidden state sequence 𝑍= 𝑧 1 , …, 𝑧 𝑁 maximizing 𝑝 𝑍 𝑋,HMM) Problem 4: given an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 , compute, for any 𝑛, 𝑘, the probability 𝑝 𝑧 𝑛 = 𝑠 𝑘 𝑋,HMM) We will first look at algorithms for problems 2, 3, 4. Last, we will look at the standard algorithm for learning an HMM using training data.

Probability of Observations Problem 2: given an HMM model 𝜃, and a sequence of observations 𝑋= 𝑥 1 , …, 𝑥 𝑁 , compute 𝑝 𝑋 𝜃): the probability of 𝑋 given the model. Inputs: 𝜃=(𝜋, 𝐴,𝜑 𝑘 ): a trained HMM, with probability functions 𝜋, 𝐴,𝜑 𝑘 specified. A sequence of observations 𝑋= 𝑥 1 , …, 𝑥 𝑁 . Output: 𝑝 𝑋 𝜃).

Probability of Observations Problem 2: given a sequence of observations 𝑋= 𝑥 1 , …, 𝑥 𝑁 , compute 𝑝 𝑋 𝜃). Why do we care about this problem? What would be an example application?

Probability of Observations Problem 2: given a sequence of observations 𝑋= 𝑥 1 , …, 𝑥 𝑁 , compute 𝑝 𝑋 𝜃). Why do we care about this problem? We need to compute 𝑝 𝑋 𝜃) to classify 𝑋. Suppose we have multiple models 𝜃 1 ,…, 𝜃 𝐶 . For example, we can have one model for digit 2, one model for digit 3, one model for digit 4… A Bayesian classifier classifies 𝑋 by finding the 𝜃 𝑐 that maximizes 𝑝( 𝜃 𝑐 𝑋 . Using Bayes rule, we must maximize 𝑝(𝑋 𝜃 𝑐 𝑝( 𝜃 𝑐 ) 𝑝(𝑋) . Therefore, we need to compute 𝑝(𝑋 𝜃 𝑐 for each 𝑐.

The Sum Rule The sum rule, which we saw earlier in the course, states that: 𝑝 𝑋 = 𝑦∈𝕐 𝑝 𝑋, 𝑌=𝑦 We can use the sum rule, to compute 𝑝 𝑋 | 𝜃 as follows: 𝑝 𝑋 | 𝜃 = 𝑍 𝑝 𝑋, 𝑍 | 𝜃 According to this formula, we can compute 𝑝 𝑋 | 𝜃 by summing 𝑝 𝑋, 𝑍 | 𝜃 over all possible state sequences 𝑍.

The Sum Rule 𝑝 𝑋 | 𝜃 = 𝑍 𝑝 𝑋, 𝑍 | 𝜃 𝑝 𝑋 | 𝜃 = 𝑍 𝑝 𝑋, 𝑍 | 𝜃 According to this formula, we can compute 𝑝 𝑋 | 𝜃 by summing 𝑝 𝑋, 𝑍 | 𝜃 over all possible state sequences 𝑍. An HMM with parameters 𝜃 specifies a joint distribution function 𝑝 𝑋,𝑍 | 𝜃 as: 𝑝 𝑋,𝑍 | 𝜃 =𝑝 𝑧 1 | 𝜃 𝑛=2 𝑁 𝑝 𝑧 𝑛 𝑧 𝑛−1 , 𝜃) 𝑛=1 𝑁 𝑝 𝑥 𝑛 𝑧 𝑛 ,𝜃) Therefore: 𝑝 𝑋 | 𝜃 = 𝑍 𝑝 𝑧 1 | 𝜃 𝑛=2 𝑁 𝑝 𝑧 𝑛 𝑧 𝑛−1 ,𝜃) 𝑛=1 𝑁 𝑝 𝑥 𝑛 𝑧 𝑛 ,𝜃)

The Sum Rule 𝑝 𝑋 | 𝜃 = 𝑍 𝑝 𝑧 1 | 𝜃 𝑛=2 𝑁 𝑝 𝑧 𝑛 𝑧 𝑛−1 ,𝜃) 𝑛=1 𝑁 𝑝 𝑥 𝑛 𝑧 𝑛 ,𝜃) According to this formula, we can compute 𝑝 𝑋 | 𝜃 by summing 𝑝 𝑋, 𝑍 | 𝜃 over all possible state sequences 𝑍. What would be the time complexity of doing this computation directly? It would be linear to the number of all possible state sequences 𝑍, which is typically exponential to 𝑁 (the length of 𝑋). Luckily, there is a polynomial time algorithm for computing 𝑝 𝑋 | 𝜃 , that uses dynamic programming.

Dynamic Programming for 𝑝 𝑋 | 𝜃 𝑝 𝑋 | 𝜃 = 𝑍 𝑝 𝑧 1 | 𝜃 𝑛=2 𝑁 𝑝 𝑧 𝑛 𝑧 𝑛−1 , 𝜃) 𝑛=1 𝑁 𝑝 𝑥 𝑛 𝑧 𝑛 ,𝜃) To compute 𝑝 𝑋 | 𝜃 , we use a dynamic programming algorithm that is called the forward algorithm. We define a 2D array of problems, of size 𝑁×𝐾. 𝑁: length of observation sequence 𝑋. 𝐾: number of states in the HMM. Problem (𝑛,𝑘): Compute α 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃

The Forward Algorithm - Initialization 𝑝 𝑋 | 𝜃 = 𝑍 𝑝 𝑧 1 | 𝜃 𝑛=2 𝑁 𝑝 𝑧 𝑛 𝑧 𝑛−1 , 𝜃) 𝑛=1 𝑁 𝑝 𝑥 𝑛 𝑧 𝑛 ,𝜃) Problem (𝑛,𝑘): Compute α 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 Problem (1,𝑘): ???

The Forward Algorithm - Initialization 𝑝 𝑋 | 𝜃 = 𝑍 𝑝 𝑧 1 | 𝜃 𝑛=2 𝑁 𝑝 𝑧 𝑛 𝑧 𝑛−1 , 𝜃) 𝑛=1 𝑁 𝑝 𝑥 𝑛 𝑧 𝑛 ,𝜃) Problem (𝑛,𝑘): Compute α 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 Problem (1,𝑘): Compute α 1,𝑘 =𝑝 𝑥 1 , 𝑧 1 = 𝑠 𝑘 | 𝜃 Solution: ???

The Forward Algorithm - Initialization 𝑝 𝑋 | 𝜃 = 𝑍 𝑝 𝑧 1 | 𝜃 𝑛=2 𝑁 𝑝 𝑧 𝑛 𝑧 𝑛−1 , 𝜃) 𝑛=1 𝑁 𝑝 𝑥 𝑛 𝑧 𝑛 ,𝜃) Problem (𝑛,𝑘): Compute α 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 Problem (1,𝑘): Compute α 1,𝑘 =𝑝 𝑥 1 , 𝑧 1 = 𝑠 𝑘 | 𝜃 Solution: α 1,𝑘 =𝑝 𝑥 1 , 𝑧 1 =𝑘 | 𝜃 = 𝜋 𝑘 𝜑 𝑘 ( 𝑥 1 )

The Forward Algorithm - Main Loop Problem (𝑛,𝑘): Compute α 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 Solution:

The Forward Algorithm - Main Loop Problem (𝑛,𝑘): Compute α 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 Solution: 𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 = 𝑗=1 𝐾 𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛−1 = 𝑠 𝑗 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 = 𝜑 𝑘 ( 𝑥 𝑛 ) 𝑗=1 𝐾 𝑝 𝑥 1 ,…, 𝑥 𝑛−1 , 𝑧 𝑛−1 = 𝑠 𝑗 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 = 𝜑 𝑘 𝑥 𝑛 𝑗=1 𝐾 𝑝 𝑥 1 ,…, 𝑥 𝑛−1 , 𝑧 𝑛−1 = 𝑠 𝑗 | 𝜃 𝐴 𝑗,𝑘

The Forward Algorithm - Main Loop Problem (𝑛,𝑘): Compute α 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 Solution: 𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 =𝑘 | 𝜃

The Forward Algorithm - Main Loop Problem (𝑛,𝑘): Compute α 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 Solution: 𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 =𝑘 | 𝜃 =𝜑 𝑘 ( 𝑥 𝑛 ) 𝑗=1 𝐾 α[𝑛−1,𝑗] 𝐴 𝑗,𝑘 Thus, α 𝑛,𝑘 =𝜑 𝑘 ( 𝑥 𝑛 ) 𝑗=1 𝐾 α[𝑛−1,𝑗] 𝐴 𝑗,𝑘 . α 𝑛,𝑘 is easy to compute once all α[𝑛−1,𝑗] values have been computed.

The Forward Algorithm Our aim was to compute 𝑝 𝑋 | 𝜃 . Using the dynamic programming algorithm we just described, we can compute α 𝑁,𝑘 =𝑝 𝑋, 𝑧 𝑁 = 𝑠 𝑘 | 𝜃 . How can we use those α 𝑁,𝑘 values to compute 𝑝 𝑋 | 𝜃 ? We use the sum rule:

Problem 3: Find the Best 𝑍 Inputs: a trained HMM 𝜃, and an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 Output: the state sequence 𝑍= 𝑧 1 , …, 𝑧 𝑁 maximizing 𝑝 𝑍 𝜃, 𝑋). Why do we care? Some times, we want to use HMMs for classification. In those cases, we do not really care about maximizing 𝑝 𝑍 𝜃, 𝑋), we just want to find the HMM parameters 𝜃 that maximize 𝑝 𝜃 𝑋). Some times, we want to use HMMs to figure out the most likely values of the hidden states given the observations. In the tree ring example, our goal was to figure out the average temperature for each year. Those average temperatures were the hidden states. Solving problem 3 provides the most likely sequence of average temperatures.

Problem 3: Find the Best 𝑍 Input: an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 Output: the state sequence 𝑍= 𝑧 1 , …, 𝑧 𝑁 maximizing 𝑝 𝑍 𝜃, 𝑋). We want to maximize 𝑝 𝑍 𝜃, 𝑋). Using the definition of conditional probabilities: 𝑝 𝑍 𝜃, 𝑋)= 𝑝 𝑋,𝑍 𝜃) 𝑝 𝑋 𝜃) Since 𝑋 is known, 𝑝 𝑋 𝜃) will be the same over all possible state sequences 𝑍. Therefore, to find the 𝑍 that maximizes 𝑝 𝑍 𝜃, 𝑋), it suffices to find the 𝑍 that maximizes 𝑝 𝑍,𝑋 𝜃).

Problem 3: Find the Best 𝑍 Input: an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 Output: the state sequence 𝑍= 𝑧 1 , …, 𝑧 𝑁 maximizing 𝑝 𝑍 𝜃, 𝑋), which is the same as maximizing 𝑝 𝑍,𝑋 𝜃). Solution: (again) dynamic programming. Problem (𝑛,𝑘): Compute G 𝑛,𝑘 and H 𝑛,𝑘 , where: G 𝑛,𝑘 = argmax 𝑧 1 ,…, 𝑧 𝑛−1 𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 1 ,…, 𝑧 𝑛−1 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 Note that: We optimize over all possible values 𝑧 1 ,…, 𝑧 𝑛−1 . However, 𝒛 𝒏 is constrained to be equal to 𝒔 𝒌 . Values 𝑥 1 ,…, 𝑥 𝑛 are known, and thus they are fixed.

Problem 3: Find the Best 𝑍 Input: an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 Output: the state sequence 𝑍= 𝑧 1 , …, 𝑧 𝑁 maximizing 𝑝 𝑍 𝜃, 𝑋), which is the same as maximizing 𝑝 𝑍,𝑋 𝜃). Solution: (again) dynamic programming. Problem (𝑛,𝑘): Compute G 𝑛,𝑘 and H 𝑛,𝑘 , where: H 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 ,G 𝑛,𝑘 | 𝜃 H 𝑛,𝑘 is the joint probability of: the first 𝑛 observations 𝑥 1 ,…, 𝑥 𝑛 the sequence we stored in G 𝑛,𝑘

Problem 3: Find the Best 𝑍 Input: an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 Output: the state sequence 𝑍= 𝑧 1 , …, 𝑧 𝑁 maximizing 𝑝 𝑍 𝜃, 𝑋), which is the same as maximizing 𝑝 𝑍,𝑋 𝜃). Solution: (again) dynamic programming. Problem (𝑛,𝑘): Compute G 𝑛,𝑘 and H 𝑛,𝑘 , where: H 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 ,G 𝑛,𝑘 | 𝜃 H 𝑛,𝑘 is the joint probability of: the first 𝑛 observations 𝑥 1 ,…, 𝑥 𝑛 the sequence we stored in G 𝑛,𝑘 G 𝑛,𝑘 stores a sequence of state values. H 𝑛,𝑘 stores a number (a probability).

The Viterbi Algorithm Problem (𝑛,𝑘): Compute G 𝑛,𝑘 and H 𝑛,𝑘 , where: G 𝑛,𝑘 = argmax 𝑧 1 ,…, 𝑧 𝑛−1 𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 1 ,…, 𝑧 𝑛−1 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 H 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 ,G 𝑛,𝑘 | 𝜃 We use a dynamic programming algorithm to compute G 𝑛,𝑘 𝑛,𝑘 and H 𝑛,𝑘 𝑛,𝑘 for all 𝑛,𝑘 such that 1≤𝑛≤𝑁, 1≤𝑘≤𝐾.

The Viterbi Algorithm - Initialization Problem (1,𝑘): Compute G 1,𝑘 and H 1,𝑘 , where: G 1,𝑘 = argmax 𝑧 1 ,…, 𝑧 𝑛−1 𝑝 𝑥 1 , 𝑧 1 = 𝑠 𝑘 | 𝜃 H 1,𝑘 =𝑝 𝑥 1 , 𝑧 1 = 𝑠 𝑘 | 𝜃 What is G 1,𝑘 ? It is the empty sequence. We must maximize over the previous states 𝑧 1 ,…, 𝑧 𝑛−1 , but 𝑛=1, so there are no previous states. What is H 1,𝑘 ? H 1,𝑘 =𝑝 𝑥 1 , 𝑧 1 = 𝑠 𝑘 | 𝜃 = 𝜋 𝑘 𝜑 𝑘 𝑥 1 .

The Viterbi Algorithm – Main Loop Problem (𝑛,𝑘): Compute G 𝑛,𝑘 and H 𝑛,𝑘 , where: G 𝑛,𝑘 = argmax 𝑧 1 ,…, 𝑧 𝑛−1 𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 1 ,…, 𝑧 𝑛−1 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 H 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 ,G 𝑛,𝑘 | 𝜃 𝑥 1 ,…, 𝑥 𝑛 ,G 𝑛,𝑘 | 𝜃 G 𝑛,𝑘 = 𝐺 ∗ 𝑛,𝑘 ⊗ 𝑠 𝑗 for some 𝐺 ∗ 𝑛,𝑘 and some 𝑗. How do we find G 𝑛,𝑘 ? The last element of G 𝑛,𝑘 is some 𝑠 𝑗 .

The Viterbi Algorithm – Main Loop Problem (𝑛,𝑘): Compute G 𝑛,𝑘 and H 𝑛,𝑘 , where: G 𝑛,𝑘 = argmax 𝑧 1 ,…, 𝑧 𝑛−1 𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 1 ,…, 𝑧 𝑛−1 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 H 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 ,G 𝑛,𝑘 | 𝜃 G 𝑛,𝑘 = 𝐺 ∗ 𝑛,𝑘 ⊗ 𝑠 𝑗 for some 𝐺 ∗ 𝑛,𝑘 and some 𝑗. We can prove that 𝐺 ∗ 𝑛,𝑘 =𝐺 𝑛−1,𝑗 , for that 𝑗. The proof is very similar to the proof that DTW finds the optimal alignment.

The Viterbi Algorithm – Main Loop Problem (𝑛,𝑘): Compute G 𝑛,𝑘 and H 𝑛,𝑘 , where: G 𝑛,𝑘 = argmax 𝑧 1 ,…, 𝑧 𝑛−1 𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 1 ,…, 𝑧 𝑛−1 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 H 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 ,G 𝑛,𝑘 | 𝜃 Let's define 𝑗 ∗ as: 𝑗 ∗ = argmax 𝑗=1,…,𝐾 𝐻 𝑛−1,𝑗 𝐴 𝑗,𝑘 𝜑 𝑘 𝑥 𝑛 argmax 𝑗=1,…,𝐾 𝐻 𝑛−1,𝑗 𝐴 𝑗,𝑘 𝜑 𝑘 𝑥 𝑛 argmax 𝑗=1,…,𝐾 𝐻 𝑛−1,𝑗 𝐴 𝑗,𝑘 𝜑 𝑘 𝑥 𝑛 G 𝑛,𝑘 =𝐺 𝑛−1,𝑗 ⊗ 𝑠 𝑗 for some 𝑗. Then: G 𝑛,𝑘 =𝐺 𝑛−1, 𝑗 ∗ ⊗ 𝑠 𝑗 ∗ .

The Viterbi Algorithm – Output Our goal is to find the 𝑍 maximizing 𝑝 𝑍 𝜃,𝑋), which is the same as finding the 𝑍 maximizing 𝑝 𝑍,𝑋 𝜃). The Viterbi algorithm computes G 𝑛,𝑘 and H 𝑛,𝑘 , where: G 𝑛,𝑘 = argmax 𝑧 1 ,…, 𝑧 𝑛−1 𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 1 ,…, 𝑧 𝑛−1 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 H 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 ,G 𝑛,𝑘 | 𝜃 Let's define 𝑘 ∗ as: 𝑘 ∗ = argmax 𝑘=1,…,𝐾 𝐻 𝑁,𝑘

State Probabilities at Specific Times Problem 4: Inputs: a trained HMM 𝜃, and an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 Output: an array 𝛾, of size 𝑁×𝐾, where 𝛾 𝑛,𝑘 =𝑝 𝑧 𝑛 = 𝑠 𝑘 𝑋,𝜃). In words: given an observation sequence 𝑋, we want to compute, for any moment in time 𝑛 and any state 𝑠 𝑘 , the probability that the hidden state at that moment is 𝑠 𝑘 .

State Probabilities at Specific Times Problem 4: Inputs: a trained HMM 𝜃, and an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 Output: an array 𝛾, of size 𝑁×𝐾, where 𝛾 𝑛,𝑘 =𝑝 𝑧 𝑛 = 𝑠 𝑘 𝑋,𝜃). We have seen, in our solution for problem 2, that the forward algorithm computes an array α, of size 𝑁×𝐾, where α 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 . We can also define another array β, of size 𝑁×𝐾, where β 𝑛,𝑘 =𝑝 𝑥 𝑛+1 ,…, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 . In words, β 𝑛,𝑘 is the probability of all observations after moment 𝑛, given that the state at moment 𝑛 is 𝑠 𝑘 .

State Probabilities at Specific Times Problem 4: Inputs: a trained HMM 𝜃, and an observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 Output: an array 𝛾, of size 𝑁×𝐾, where 𝛾 𝑛,𝑘 =𝑝 𝑧 𝑛 = 𝑠 𝑘 𝑋,𝜃). We have seen, in our solution for problem 2, that the forward algorithm computes an array α, of size 𝑁×𝐾, where α 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 . We can also define another array β, of size 𝑁×𝐾, where β 𝑛,𝑘 =𝑝 𝑥 𝑛+1 ,…, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 . Then, α 𝑛,𝑘 ∗β 𝑛,𝑘 = 𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 ∗𝑝 𝑥 𝑛+1 ,…, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑝(𝑋, 𝑧 𝑛 = 𝑠 𝑘 | 𝜃) Therefore: 𝛾 𝑛,𝑘 =𝑝 𝑧 𝑛 = 𝑠 𝑘 𝑋,𝜃)= α 𝑛,𝑘 ∗β 𝑛,𝑘 𝑃 𝑋 𝜃)

State Probabilities at Specific Times The forward algorithm computes an array α, of size 𝑁×𝐾, where α 𝑛,𝑘 =𝑝 𝑥 1 ,…, 𝑥 𝑛 , 𝑧 𝑛 = 𝑠 𝑘 | 𝜃 . We define another array β, of size 𝑁×𝐾, where β 𝑛,𝑘 =𝑝 𝑥 𝑛+1 ,…, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 . Then, 𝛾 𝑛,𝑘 =𝑝 𝑧 𝑛 = 𝑠 𝑘 𝑋,𝜃)= α 𝑛,𝑘 ∗β 𝑛,𝑘 𝑃 𝑋 𝜃) In the above equation: α 𝑛,𝑘 and 𝑃 𝑋 𝜃) are computed by the forward algorithm. We need to compute β 𝑛,𝑘 . We compute β 𝑛,𝑘 in a way very similar to the forward algorithm, but going backwards this time. The resulting combination of the forward and backward algorithms is called (not surprisingly) the forward-backward algorithm.

The Backward Algorithm We want to compute the values of an array β, of size 𝑁×𝐾, where β 𝑛,𝑘 =𝑝 𝑥 𝑛+1 ,…, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 . Again, we will use dynamic programming. Problem (𝑛,𝑘) is to compute value β 𝑛,𝑘 . However, this time we start from the end of the observations. First, compute β 𝑁,𝑘 =𝑝 {}| 𝜃, 𝑧 𝑁 = 𝑠 𝑘 . In this case, the observation sequence 𝑥 𝑛+1 ,…, 𝑥 𝑁 is empty, because 𝑛=𝑁. What is the probability of an empty sequence?

Backward Algorithm - Initialization We want to compute the values of an array β, of size 𝑁×𝐾, where β 𝑛,𝑘 =𝑝 𝑥 𝑛+1 ,…, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 . Again, we will use dynamic programming. Problem (𝑛,𝑘) is to compute value β 𝑛,𝑘 . However, this time we start from the end of the observations. First, compute β 𝑁,𝑘 =𝑝 {}| 𝜃, 𝑧 𝑁 = 𝑠 𝑘 . In this case, the observation sequence 𝑥 𝑛+1 ,…, 𝑥 𝑁 is empty, because 𝑛=𝑁. What is the probability of an empty sequence? β 𝑁,𝑘 =1

Backward Algorithm - Initialization Next: compute β 𝑁−1,𝑘 . β 𝑁−1,𝑘 =𝑝 𝑥 𝑁 | 𝜃, 𝑧 𝑁−1 = 𝑠 𝑘 = 𝑗=1 𝐾 𝑝 𝑥 𝑁 , 𝑧 𝑁 = 𝑠 𝑗 | 𝜃, 𝑧 𝑁−1 = 𝑠 𝑘 = 𝑗=1 𝐾 𝑝 𝑥 𝑁 | 𝑧 𝑁 = 𝑠 𝑗 𝑝 𝑧 𝑁 = 𝑠 𝑗 | 𝜃, 𝑧 𝑁−1 = 𝑠 𝑘

Backward Algorithm - Initialization Next: compute β 𝑁−1,𝑘 . β 𝑁−1,𝑘 =𝑝 𝑥 𝑁 | 𝜃, 𝑧 𝑁−1 = 𝑠 𝑘 = 𝑗=1 𝐾 𝑝 𝑥 𝑁 𝑧 𝑁 = 𝑠 𝑗 𝑝( 𝑧 𝑁 = 𝑠 𝑗 | 𝜃, 𝑧 𝑁−1 = 𝑠 𝑘 = 𝑗=1 𝐾 𝜑 𝑗 𝑥 𝑁 𝐴 𝑘,𝑗

Backward Algorithm – Main Loop Next: compute β 𝑛,𝑘 , for 𝑛<𝑁−1. β 𝑛,𝑘 =𝑝 𝑥 𝑛+1, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑗=1 𝐾 𝑝 𝑥 𝑛+1, …, 𝑥 𝑁 , 𝑧 𝑛+1 = 𝑠 𝑗 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑗=1 𝐾 𝑝 𝑥 𝑛+1 | 𝑧 𝑛+1 = 𝑠 𝑗 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 , 𝑧 𝑛+1 = 𝑠 𝑗 | 𝜃, 𝑧 𝑁−1 = 𝑠 𝑘

Backward Algorithm – Main Loop Next: compute β 𝑛,𝑘 , for 𝑛<𝑁−1. β 𝑛,𝑘 =𝑝 𝑥 𝑛+1, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑗=1 𝐾 𝑝 𝑥 𝑛+1 | 𝑧 𝑛+1 = 𝑠 𝑗 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 , 𝑧 𝑛+1 = 𝑠 𝑗 | 𝜃, 𝑧 𝑁−1 = 𝑠 𝑘 𝑗=1 𝐾 𝜑 𝑗 𝑥 𝑛+1 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛+1 = 𝑠 𝑗 𝐴 𝑘,𝑗 We will take a closer look at the last step…

Backward Algorithm – Main Loop 𝑗=1 𝐾 𝑝 𝑥 𝑛+1 | 𝑧 𝑛+1 = 𝑠 𝑗 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 , 𝑧 𝑛+1 = 𝑠 𝑗 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑗=1 𝐾 𝜑 𝑗 𝑥 𝑛+1 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛+1 = 𝑠 𝑗 𝐴 𝑘,𝑗 𝑝 𝑥 𝑛+1 | 𝑧 𝑛+1 = 𝑠 𝑗 = 𝜑 𝑗 𝑥 𝑛+1 , by definition of 𝜑 𝑗 .

Backward Algorithm – Main Loop 𝑗=1 𝐾 𝑝 𝑥 𝑛+1 | 𝑧 𝑛+1 = 𝑠 𝑗 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 , 𝑧 𝑛+1 = 𝑠 𝑗 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑗=1 𝐾 𝜑 𝑗 𝑥 𝑛+1 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛+1 = 𝑠 𝑗 𝐴 𝑘,𝑗 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 , 𝑧 𝑛+1 = 𝑠 𝑗 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 , 𝑧 𝑛+1 = 𝑠 𝑗 𝑝 𝑧 𝑛+1 = 𝑠 𝑗 | 𝑧 𝑛 = 𝑠 𝑘 by application of chain rule: 𝑃 𝐴,𝐵 =𝑃 𝐴 𝐵 𝑃(𝐵), which can also be written as 𝑃 𝐴,𝐵 | 𝐶 =𝑃 𝐴 𝐵, 𝐶 𝑃 𝐵 𝐶)

Backward Algorithm – Main Loop 𝑗=1 𝐾 𝑝 𝑥 𝑛+1 | 𝑧 𝑛+1 = 𝑠 𝑗 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 , 𝑧 𝑛+1 = 𝑠 𝑗 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑗=1 𝐾 𝜑 𝑗 𝑥 𝑛+1 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛+1 = 𝑠 𝑗 𝐴 𝑘,𝑗 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 , 𝑧 𝑛+1 = 𝑠 𝑗 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 , 𝑧 𝑛+1 = 𝑠 𝑗 𝑝 𝑧 𝑛+1 = 𝑠 𝑗 | 𝑧 𝑛 = 𝑠 𝑘 = 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 , 𝑧 𝑛+1 = 𝑠 𝑗 𝐴 𝑘,𝑗 (by definition of 𝐴 𝑘,𝑗 )

Backward Algorithm – Main Loop 𝑗=1 𝐾 𝑝 𝑥 𝑛+1 | 𝑧 𝑛+1 = 𝑠 𝑗 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 , 𝑧 𝑛+1 = 𝑠 𝑗 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑗=1 𝐾 𝜑 𝑗 𝑥 𝑛+1 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛+1 = 𝑠 𝑗 𝐴 𝑘,𝑗 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 , 𝑧 𝑛+1 = 𝑠 𝑗 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 , 𝑧 𝑛+1 = 𝑠 𝑗 𝑝 𝑧 𝑛+1 = 𝑠 𝑗 | 𝑧 𝑛 = 𝑠 𝑘 = 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 , 𝑧 𝑛+1 = 𝑠 𝑗 𝐴 𝑘,𝑗 = 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛+1 = 𝑠 𝑗 𝐴 𝑘,𝑗 ( 𝑥 𝑛+2, …, 𝑥 𝑁 are conditionally independent of 𝑧 𝑛 given 𝑧 𝑛+1 )

Backward Algorithm – Main Loop In the previous slides, we showed that β 𝑛,𝑘 =𝑝 𝑥 𝑛+1, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛 = 𝑠 𝑘 = 𝑗=1 𝐾 𝜑 𝑗 𝑥 𝑛+1 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛+1 = 𝑠 𝑗 𝐴 𝑘,𝑗 Note that 𝑝 𝑥 𝑛+2, …, 𝑥 𝑁 | 𝜃, 𝑧 𝑛+1 = 𝑠 𝑗 is β 𝑛+1,𝑗 . Consequently, we obtain the recurrence: β 𝑛,𝑘 = 𝑗=1 𝐾 𝜑 𝑗 𝑥 𝑛+1 β 𝑛+1,𝑗 𝐴 𝑘,𝑗

Backward Algorithm – Main Loop β 𝑛,𝑘 = 𝑗=1 𝐾 𝜑 𝑗 𝑥 𝑛+1 β 𝑛+1,𝑗 𝐴 𝑘,𝑗 Thus, we can easily compute all values β 𝑛,𝑘 using this recurrence. The key thing is to proceed in decreasing order of 𝑛, since values β 𝑛,∗ depend on values β 𝑛+1,∗ .

The Forward-Backward Algorithm Inputs: A trained HMM 𝜃. An observation sequence 𝑋= 𝑥 1 , …, 𝑥 𝑁 Output: An array 𝛾, of size 𝑁×𝐾, where 𝛾 𝑛,𝑘 =𝑝 𝑧 𝑛 = 𝑠 𝑘 𝑋,𝜃). Algorithm: Use the forward algorithm to compute α 𝑛,𝑘 . Use the backward algorithm to compute β 𝑛,𝑘 . For 𝑛=1 to 𝑁: For 𝑘=1 to 𝐾: 𝛾 𝑛,𝑘 =𝑝 𝑧 𝑛 = 𝑠 𝑘 𝑋,𝜃)= α 𝑛,𝑘 ∗β 𝑛,𝑘 𝑃 𝑋 𝜃)

Problem 1: Training an HMM Goal: Estimate parameters 𝜃=( 𝜋 𝑘 , 𝐴 𝑘,𝑗 , 𝜑 𝑘 ). The training data can consist of multiple observation sequences 𝑋 1 , 𝑋 2 , 𝑋 3 ,…, 𝑋 𝑀 . We denote the length of training sequence 𝑋 𝑖 as 𝑁 𝑚 . We denote the elements of each observation sequence 𝑋 𝑚 as: 𝑋 𝑚 =( 𝑥 𝑚,1 , 𝑥 𝑚,2 ,…, 𝑥 𝑚, 𝑁 𝑚 ) Before we start training, we need to decide on: The number of states. The transitions that we will allow.

Problem 1: Training an HMM While we are given 𝑋 1 , 𝑋 2 , 𝑋 3 ,…, 𝑋 𝑀 , we are not given the corresponding hidden state sequences 𝑍 1 , 𝑍 2 , 𝑍 3 ,…, 𝑍 𝑀 . We denote the elements of each hidden state sequence 𝑍 𝑚 as: 𝑍 𝑚 =( 𝑧 𝑚,1 , 𝑧 𝑚,2 ,…, 𝑧 𝑚, 𝑁 𝑚 ) The training algorithm is called Baum-Welch algorithm. It is an Expectation-Maximization algorithm, similar to the algorithm we saw for learning Gaussian mixtures.

Expectation-Maximization When we wanted to learn a mixture of Gaussians, we had the following problem: If we knew the probability of each object belonging to each Gaussian, we could estimate the parameters of each Gaussian. If we knew the parameters of each Gaussian, we could estimate the probability of each object belonging to each Gaussian. However, we know neither of these pieces of information. The EM algorithm resolved this problem using: An initialization of Gaussian parameters to some random or non-random values. A main loop where: The current values of the Gaussian parameters are used to estimate new weights of membership of every training object to every Gaussian. The current estimated membership weights are used to estimate new parameters (mean and covariance matrix) for each Gaussian.

Expectation-Maximization When we want to learn an HMM model using observation sequences as training data, we have the following problem: If we knew, for each observation sequence, the probabilities of the hidden state values, we could estimate the parameters 𝜃. If we knew the parameters 𝜃, we could estimate the probabilities of the hidden state values. However, we know neither of these pieces of information. The Baum-Welch algorithm resolves this problem using EM: At initialization, parameters 𝜃 are given (mostly) random values. In the main loop, these two steps are performed repeatedly: The current values of parameters 𝜃 are used to estimate new probabilities for the hidden state values. The current probabilities for the hidden state values are used to estimate new values for parameters 𝜃.

Baum-Welch: Initialization As we said before, before we start training, we need to decide on: The number of states. The transitions that we will allow. When we start training, we initialize 𝜃=( 𝜋 𝑘 , 𝐴 𝑘,𝑗 , 𝜑 𝑘 ) to random values, with these exceptions: For any 𝑠 𝑘 that we do not want to allow to ever be an initial state, we set the corresponding 𝜋 𝑘 to 0. The rest of the training will keep these 𝜋 𝑘 values always equal to 0. For any 𝑠 𝑘 , 𝑠 𝑗 such that we do not want to allow transitions from 𝑠 𝑘 to 𝑠 𝑗 , we set the corresponding 𝐴 𝑘,𝑗 to 0. The rest of the training will keep these 𝐴 𝑘,𝑗 values always equal to 0.

Baum-Welch: Initialization When we start training, we initialize 𝜃=( 𝜋 𝑘 , 𝐴 𝑘,𝑗 , 𝜑 𝑘 ) to random values, with these exceptions: For any 𝑠 𝑘 that we do not want to allow to ever be an initial state, we set the corresponding 𝜋 𝑘 to 0. The rest of the training will keep these 𝜋 𝑘 values always equal to 0. For any 𝑠 𝑘 , 𝑠 𝑗 such that we do not want to allow transitions from 𝑠 𝑘 to 𝑠 𝑗 , we set the corresponding 𝐴 𝑘,𝑗 to 0. The rest of the training will keep these 𝐴 𝑘,𝑗 values always equal to 0. These initial choices constrain the topology of the resulting HMM model. For example, they can force the model to be fully connected, to be a forward model, or to be some other variation.

Baum-Welch: Expectation Step Before we start the expectation step, we have some current values for the parameters of 𝜃=( 𝜋 𝑘 , 𝐴 𝑘,𝑗 , 𝜑 𝑘 ). The goal of the expectation step is to use those values to estimate two arrays: 𝛾 𝑚,𝑛,𝑘 =𝑝 𝑥 𝑚,1 ,…, 𝑥 𝑚,𝑛 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃, 𝑋 𝑚 This is the same 𝛾 𝑛,𝑘 that we computed earlier, using the forward-backward algorithm. Here we just need to make the array three-dimensional, because we have multiple observation sequences 𝑋 𝑚 . We can compute 𝛾 𝑚,∗,∗ by running the forward-backward algorithm with 𝜃 and 𝑋 𝑚 as inputs. ξ 𝑚,𝑛,𝑗,𝑘 =𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃, 𝑋 𝑚

Baum-Welch: Expectation Step To facilitate our computations, we will also extend arrays α and β to three dimensions, in the same way that we extended array 𝛾. α 𝑚,𝑛,𝑘 =𝑝 𝑥 𝑚,1 ,…, 𝑥 𝑚,𝑛 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃 β 𝑚,𝑛,𝑘 =𝑝 𝑥 𝑚,𝑛+1 ,…, 𝑥 𝑚, 𝑁 𝑚 | 𝜃, 𝑧 𝑚,𝑛 = 𝑠 𝑘 Where needed, values α[m,∗,∗], β[m,∗,∗], 𝛾[m,∗,∗] can be computed by running the forward-backward algorithm with inputs 𝜃 and 𝑋 𝑚 .

Computing the ξ Values Using Bayes rule: ξ 𝑚,𝑛,𝑗,𝑘 =𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃, 𝑋 𝑚 = 𝑝 𝑋 𝑚 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃 𝑝 𝑋 𝑚 | 𝜃 𝑝 𝑋 𝑚 | 𝜃 can be computed using the forward algorithm. We will see how to simplify and compute: 𝑝 𝑋 𝑚 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃

Computing the ξ Values ξ 𝑚,𝑛,𝑗,𝑘 =𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃, 𝑋 𝑚 = 𝑝 𝑋 𝑚 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃 𝑝 𝑋 𝑚 | 𝜃 𝑝 𝑋 𝑚 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 = 𝑝 𝑥 𝑚,1 ,…, 𝑥 𝑚,𝑛−1 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 𝑝 𝑥 𝑚,𝑛 ,…, 𝑥 𝑚, 𝑁 𝑚 | 𝜃, 𝑧 𝑚,𝑛 = 𝑠 𝑘 Why? Earlier observations 𝑥 𝑚,1 ,…, 𝑥 𝑚,𝑛−1 are conditionally independent of state 𝑧 𝑚,𝑛 given state 𝑧 𝑚,𝑛−1 . Later observations 𝑥 𝑚,𝑛 ,…, 𝑥 𝑚, 𝑁 𝑚 are conditionally independent of state 𝑧 𝑚,𝑛−1 given state 𝑧 𝑚,𝑛 .

Computing the ξ Values ξ 𝑚,𝑛,𝑗,𝑘 =𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃, 𝑋 𝑚 = 𝑝 𝑋 𝑚 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃 𝑝 𝑋 𝑚 | 𝜃 𝑝 𝑋 𝑚 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 = 𝑝 𝑥 𝑚,1 ,…, 𝑥 𝑚,𝑛−1 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 𝑝 𝑥 𝑚,𝑛 ,…, 𝑥 𝑚, 𝑁 𝑚 | 𝜃, 𝑧 𝑚,𝑛 = 𝑠 𝑘 𝑝 𝑥 𝑚,𝑛 ,…, 𝑥 𝑚, 𝑁 𝑚 | 𝜃, 𝑧 𝑚,𝑛 = 𝑠 𝑘 = 𝑝 𝑥 𝑚,𝑛 | 𝜃, 𝑧 𝑚,𝑛 = 𝑠 𝑘 𝑝 𝑥 𝑚,𝑛+1 ,…, 𝑥 𝑚, 𝑁 𝑚 | 𝜃, 𝑧 𝑚,𝑛 = 𝑠 𝑘 = 𝜑 𝑘 𝑥 𝑚,𝑛 β 𝑚,𝑛,𝑘

Computing the ξ Values ξ 𝑚,𝑛,𝑗,𝑘 =𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃, 𝑋 𝑚 = 𝑝 𝑋 𝑚 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃 𝑝 𝑋 𝑚 | 𝜃 𝑝 𝑋 𝑚 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 = 𝑝 𝑥 𝑚,1 ,…, 𝑥 𝑚,𝑛−1 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 𝜑 𝑘 𝑥 𝑚,𝑛 β 𝑚,𝑛,𝑘 = 𝑝 𝑥 𝑚,1 ,…, 𝑥 𝑚,𝑛−1 , 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 | 𝜃 𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 | 𝜃 𝜑 𝑘 𝑥 𝑚,𝑛 β 𝑚,𝑛,𝑘 = α 𝑚,𝑛−1,𝑗 𝛾 𝑚,𝑛−1,𝑗 𝜑 𝑘 𝑥 𝑚,𝑛 β 𝑚,𝑛,𝑘

Computing the ξ Values ξ 𝑚,𝑛,𝑗,𝑘 = α 𝑚,𝑛−1,𝑗 𝛾 𝑚,𝑛−1,𝑗 𝜑 𝑘 𝑥 𝑚,𝑛 β 𝑚,𝑛,𝑘 𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃 𝑝 𝑋 𝑚 | 𝜃 We can further process 𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃 : 𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃 = 𝑝 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃, 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 | 𝜃 = 𝐴 𝑗,𝑘 𝛾 𝑚,𝑛−1,𝑗

Computing the ξ Values So, we get: ξ 𝑚,𝑛,𝑗,𝑘 = α 𝑚,𝑛−1,𝑗 𝛾 𝑚,𝑛−1,𝑗 𝜑 𝑘 𝑥 𝑚,𝑛 β 𝑚,𝑛,𝑘 𝐴 𝑗,𝑘 𝛾 𝑚,𝑛−1,𝑗 𝑝 𝑋 𝑚 | 𝜃 = α 𝑚,𝑛−1,𝑗 𝜑 𝑘 𝑥 𝑚,𝑛 β 𝑚,𝑛,𝑘 𝐴 𝑗,𝑘 𝑝 𝑋 𝑚 | 𝜃

Computing the ξ Values Based on the previous slides, the final formula for ξ is: ξ 𝑚,𝑛,𝑗,𝑘 = α 𝑚,𝑛−1,𝑗 𝜑 𝑘 𝑥 𝑚,𝑛 β 𝑚,𝑛,𝑘 𝐴 𝑗,𝑘 𝑝 𝑋 𝑚 | 𝜃 This formula can be computed using the parameters 𝜃 and the algorithms we have covered: α 𝑚,𝑛−1,𝑗 is computed with the forward algorithm. β 𝑚,𝑛,𝑘 is computed by the backward algorithm. 𝑝 𝑋 𝑚 | 𝜃 is computed with the forward algorithm. 𝜑 𝑘 and 𝐴 𝑗,𝑘 are specified by 𝜃.

Baum-Welch: Summary of E Step Inputs: Training observation sequences 𝑋 1 , 𝑋 2 , 𝑋 3 ,…, 𝑋 𝑀 . Current values of parameters 𝜃=( 𝜋 𝑘 , 𝐴 𝑘,𝑗 , 𝜑 𝑘 ). Algorithm: For 𝑚=1 to 𝑀: Run the forward-backward algorithm to compute: α 𝑚,𝑛,𝑘 for 𝑛 and 𝑘 such that 1≤𝑛≤ 𝑁 𝑚 , 1≤𝑘≤𝐾. β 𝑚,𝑛,𝑘 for 𝑛 and 𝑘 such that 1≤𝑛≤ 𝑁 𝑚 , 1≤𝑘≤𝐾. 𝛾 𝑚,𝑛,𝑘 for 𝑛 and 𝑘 such that 1≤𝑛≤ 𝑁 𝑚 , 1≤𝑘≤𝐾. 𝑝 𝑋 𝑚 | 𝜃 For 𝑛, 𝑗, 𝑘 such that 1≤𝑛≤ 𝑁 𝑚 , 1≤𝑗,𝑘≤𝐾: ξ 𝑚,𝑛,𝑗,𝑘 = α 𝑚,𝑛−1,𝑗 𝜑 𝑘 𝑥 𝑚,𝑛 β 𝑚,𝑛,𝑘 𝐴 𝑗,𝑘 𝑝 𝑋 𝑚 | 𝜃

Baum-Welch: Maximization Step Before we start the maximization step, we have some current values for: 𝛾 𝑚,𝑛,𝑘 =𝑝 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃, 𝑋 𝑚 ξ 𝑚,𝑛,𝑗,𝑘 =𝑝 𝑧 𝑚,𝑛−1 = 𝑠 𝑗 , 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃, 𝑋 𝑚 The goal of the maximization step is to use those values to estimate new values for 𝜃=( 𝜋 𝑘 , 𝐴 𝑘,𝑗 , 𝜑 𝑘 ). In other words, we want to estimate new values for: Initial state probabilities 𝜋 𝑘 . State transition probabilities 𝐴 𝑘,𝑗 . Observation probabilities 𝜑 𝑘 .

Baum-Welch: Maximization Step Initial state probabilities 𝜋 𝑘 are computed as: 𝜋 𝑘 = 𝑚=1 𝑀 𝛾 𝑚,1,𝑘 𝑚=1 𝑀 𝑗=1 𝐾 𝛾 𝑚,1,𝑗 In words, we compute for each 𝑘 the ratio of these two quantities: The sum of probabilities, over all observation sequences 𝑋 𝑚 , that state 𝑠 𝑘 was the initial state for 𝑋 𝑚 . The sum of probabilities over all observation sequences 𝑋 𝑚 and all states 𝑠 𝑗 that state 𝑠 𝑗 was the initial state for 𝑋 𝑚 .

Baum-Welch: Maximization Step State transition probabilities 𝐴 𝑘,𝑗 are computed as: 𝐴 𝑘,𝑗 = 𝑚=1 𝑀 𝑛=2 𝑁 𝑚 ξ 𝑚,𝑛,𝑘,𝑗 𝑚=1 𝑀 𝑛=2 𝑁 𝑚 𝑖=1 𝐾 ξ 𝑚,𝑛,𝑘,𝑖 In words, we compute, for each 𝑘,𝑗 the ratio of these two quantities: The sum of probabilities, over all state transitions of all observation sequences 𝑋 𝑚 , that state 𝑠 𝑘 was followed by 𝑠 𝑗 . The sum of probabilities, over all state transitions of all observation sequences 𝑋 𝑚 and all states 𝑠 𝑖 , that state 𝑠 𝑘 was followed by state 𝑠 𝑖 .

Baum-Welch: Maximization Step Computing the observation probabilities 𝜑 𝑘 depends on how we model those probabilities. For example, we could choose: Discrete probabilities. Histograms. Gaussians. Mixtures of Gaussians. … We will show how to compute distributions 𝜑 𝑘 for the cases of:

Baum-Welch: Maximization Step Computing the observation probabilities 𝜑 𝑘 depends on how we model those probabilities. In all cases, the key idea is that we treat observation 𝑥 𝑚,𝑛 as partially assigned to distribution 𝜑 𝑘 . 𝛾 𝑚,𝑛,𝑘 =𝑝 𝑧 𝑚,𝑛 = 𝑠 𝑘 | 𝜃, 𝑋 𝑚 is the weight of the assignment of 𝑥 𝑚,𝑛 to 𝜑 𝑘 . In other words: We don't know what hidden state 𝑥 𝑚,𝑛 corresponds to, but we have computed, for each state 𝑠 𝑘 , the probability 𝛾 𝑚,𝑛,𝑘 that 𝑥 𝑚,𝑛 corresponds to 𝑠 𝑘 . Thus, 𝑥 𝑚,𝑛 influences our estimate of distribution 𝜑 𝑘 with weight proportional to 𝛾 𝑚,𝑛,𝑘 .

Baum-Welch: Maximization Step Suppose that the observations are discrete, and come from a finite set 𝒀= 𝑦 1 ,…, 𝑦 𝑅 . In that case, each 𝑥 𝑚,𝑛 is an element of 𝒀. Define an auxiliary function Eq 𝑥,𝑦 as: Eq 𝑥,𝑦 = 1 if 𝑥=𝑦 0 if 𝑥≠𝑦 Then, 𝜑 𝑘 𝑦 𝑟 =𝑝 𝑥 𝑚,𝑛 = 𝑦 𝑟 | 𝑧 𝑚,𝑛 = 𝑠 𝑘 , and it can be computed as: 𝜑 𝑘 𝑦 𝑟 = 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 Eq 𝑥 𝑚,𝑛 , 𝑦 𝑟 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘

Baum-Welch: Maximization Step If observations come from a finite set 𝒀= 𝑦 1 ,…, 𝑦 𝑅 : 𝜑 𝑘 𝑦 𝑟 = 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 Eq 𝑥 𝑚,𝑛 , 𝑦 𝑟 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 The above formula can be seen as a weighted average of the times 𝑥 𝑚,𝑛 is equal to 𝑦 𝑟 , where the weight of each 𝑥 𝑚,𝑛 is the probability that 𝑥 𝑚,𝑛 corresponds to hidden state 𝑠 𝑘 .

Baum-Welch: Maximization Step Suppose that observations are vectors in ℝ 𝐷 , and that we model 𝜑 𝑘 as a Gaussian distribution. In that case, for each 𝜑 𝑘 we need to estimate a mean 𝜇 𝑘 and a covariance matrix 𝑆 𝑘 . 𝜇 𝑘 𝑦 𝑟 = 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 𝑥 𝑚,𝑛 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 Thus, 𝜇 𝑘 is a weighted average of values 𝑥 𝑚,𝑛 , where the weight of each 𝑥 𝑚,𝑛 is the probability that 𝑥 𝑚,𝑛 corresponds to hidden state 𝑠 𝑘 .

Baum-Welch: Maximization Step Suppose that observations are vectors in ℝ 𝐷 , and that we model 𝜑 𝑘 as a Gaussian distribution. In that case, for each 𝜑 𝑘 we need to estimate a mean 𝜇 𝑘 and a covariance matrix 𝑆 𝑘 . 𝜇 𝑘 = 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 𝑥 𝑚,𝑛 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 𝑆 𝑘 = 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 𝑥 𝑚,𝑛 − 𝜇 𝑘 𝑥 𝑚,𝑛 − 𝜇 𝑘 𝑇 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘

Baum-Welch: Summary of M Step 𝜋 𝑘 = 𝑚=1 𝑀 𝛾 𝑚,1,𝑘 𝑚=1 𝑀 𝑗=1 𝐾 𝛾 𝑚,1,𝑗 𝐴 𝑘,𝑗 = 𝑚=1 𝑀 𝑛=2 𝑁 𝑚 ξ 𝑚,𝑛,𝑘,𝑗 𝑚=1 𝑀 𝑛=2 𝑁 𝑚 𝑖=1 𝐾 ξ 𝑚,𝑛,𝑘,𝑖 For discrete observations: 𝜑 𝑘 𝑦 𝑟 = 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 Eq 𝑥 𝑚,𝑛 , 𝑦 𝑟 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘

Baum-Welch: Summary of M Step 𝜋 𝑘 = 𝑚=1 𝑀 𝛾 𝑚,1,𝑘 𝑚=1 𝑀 𝑗=1 𝐾 𝛾 𝑚,1,𝑗 𝐴 𝑘,𝑗 = 𝑚=1 𝑀 𝑛=2 𝑁 𝑚 ξ 𝑚,𝑛,𝑘,𝑗 𝑚=1 𝑀 𝑛=2 𝑁 𝑚 𝑖=1 𝐾 ξ 𝑚,𝑛,𝑘,𝑖 For Gaussian observation distributions: 𝜑 𝑘 is a Gaussian with mean 𝜇 𝑘 and covariance matrix 𝑆 𝑘 , where: 𝜇 𝑘 = 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 𝑥 𝑚,𝑛 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 𝑆 𝑘 = 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘 𝑥 𝑚,𝑛 − 𝜇 𝑘 𝑥 𝑚,𝑛 − 𝜇 𝑘 𝑇 𝑚=1 𝑀 𝑛=1 𝑁 𝑚 𝛾 𝑚,𝑛,𝑘

Baum-Welch Summary Initialization: Main loop: Initialize 𝜃=( 𝜋 𝑘 , 𝐴 𝑘,𝑗 , 𝜑 𝑘 ) with random values, but set to 0 values 𝜋 𝑘 and 𝐴 𝑘,𝑗 to specify the network topology. Main loop: E-step: Compute all values 𝑝 𝑋 𝑚 | 𝜃 , α 𝑚,𝑛,𝑘 , β 𝑚,𝑛,𝑘 , 𝛾 𝑚,𝑛,𝑘 using the forward-backward algorithm. Compute all values ξ 𝑚,𝑛,𝑗,𝑘 = α 𝑚,𝑛−1,𝑗 𝜑 𝑘 𝑥 𝑚,𝑛 β 𝑚,𝑛,𝑘 𝐴 𝑗,𝑘 𝑝 𝑋 𝑚 | 𝜃 . M-step: Update 𝜃=( 𝜋 𝑘 , 𝐴 𝑘,𝑗 , 𝜑 𝑘 ), using the values 𝛾 𝑚,𝑛,𝑘 and ξ 𝑚,𝑛,𝑗,𝑘 computed at the E-step.

Hidden Markov Models - Recap HMMs are widely used to model temporal sequences. In HMMs, an observation sequence corresponds to a sequence of hidden states. An HMM is defined by parameters 𝜃=( 𝜋 𝑘 , 𝐴 𝑘,𝑗 , 𝜑 𝑘 ), and defines a joint probability 𝑝 𝑋,𝑍 | 𝜃 . An HMM can be used as a generative model, to produce synthetic samples from distribution 𝑝 𝑋,𝑍 | 𝜃 . HMMs can be used for Bayesian classification: One HMM 𝜃 𝑐 per class. Find the class 𝑐 that maximizes 𝑝 𝜃 𝑐 | 𝑋 .

Hidden Markov Models - Recap HMMs can be used for Bayesian classification: One HMM 𝜃 𝑐 per class. Find the class 𝑐 that maximizes 𝑝 𝜃 𝑐 | 𝑋 , using probabilities 𝑝 𝑋 | 𝜃 𝑐 calculated by the forward algorithm. HMMs can be used to find the most likely hidden state sequence 𝑍 for observation sequence 𝑋. This is done using the Viterbi algorithm. HMMs can be used to find the most likely hidden state 𝑧 𝑛 for a specific value 𝑥 𝑛 of an observation sequence 𝑋. This is done using the forward-backward algorithm. HMMs are trained using the Baum-Welch algorithm, which is an Expectation-Maximization algorithm.