Hidden Markov Models Achim Tresch MPI for Plant Breedging Research & University of Cologne.

Slides:



Advertisements
Similar presentations
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Advertisements

Introduction to Hidden Markov Models
Hidden Markov Models.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Visual Recognition Tutorial
… Hidden Markov Models Markov assumption: Transition model:
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Lecture 5: Learning models using EM
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
Part 4 c Baum-Welch Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Chapter 3 (part 3): Maximum-Likelihood and Bayesian Parameter Estimation Hidden Markov Model: Extension of Markov Chains All materials used in this course.
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Markov Models. Markov Chain A sequence of states: X 1, X 2, X 3, … Usually over time The transition from X t-1 to X t depends only on X t-1 (Markov Property).
Visual Recognition Tutorial
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
Computer vision: models, learning and inference
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
Lecture 2: Statistical learning primer for biologists
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Distributed cooperation and coordination using the Max-Sum algorithm
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models BMI/CS 576
Hidden Markov Models.
Statistical Models for Automatic Speech Recognition
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Markov Random Fields Presented by: Vladan Radosavljevic.
Expectation-Maximization & Belief Propagation
LECTURE 15: REESTIMATION, EM AND MIXTURES
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Hidden Markov Models Achim Tresch MPI for Plant Breedging Research & University of Cologne

Recap: Proabilistic Clustering H1H1 X1X1 H2H2 X2X2 HNHN XNXN... Observations Hidden variables (“states“) indicating cluster which generated the observation H2H2 X2X2 Parameters: Cluster frequencies Emission distributions

The Hidden Markov Model H1H1 X1X1 H2H2 X2X2 HNHN XNXN... Observations Hidden variables (“states“) indicating cluster which generated the observation H2H2 X2X2 Hidden states become dependent, they form a Markov Chain Parameters: Cluster frequencies Emission distributions to be replaced

Markov Chain H1H1 H2H2 HNHN H2H2... This factorization always holds. But it is not useful, since the last term (j=N) still contains all variables. Goal: Factorize joint probability into products of “smaller“ terms (that depend merely on a few variables) Markov assumption (“memoryless process“):

Markov Chain Markov assumption Under the Markov assumption, the joint distribution factorizes into Homogeneity assumption: The joint distribution of a homogenous Markov chain is Initial state distribution Transition probabilities Parameters:

The Hidden Markov Model H1H1 X1X1 H2H2 X2X2 HNHN XNXN... Observations Hidden variables (“states“) indicating cluster which generated the observation H2H2 X2X2 Markov Chain Parameters: Initial state distribution Emission distributions Transition probabilities Hidden Markov Model:

HMM Parameter Estimation We will introduce the Baum-Welch Algorithm, an Expectation-Maximization (EM) algorithm for the HMM. I.e., we iteratively maximize the lower bound function Q. Parameters Hidden state variables H Observations X We will focus on the learning of the transition probabilities A=(a rs ). The learning of the initial distribution is easier, and the learning of the emission distributions leads to exactly the same formulas as for clustering.

HMM Parameter Estimation Omitting A -independent terms leads to: (summation of each H j is over all states 1,...,K ) These coefficients need to be calculated efficiently! (see later)

HMM Parameter Estimation Omitting A -independent terms leads to: (summation of each H j is over all states 1,...,K ) These coefficients need to be calculated efficiently! (see later)

HMM Parameter Estimation Omitting A -independent terms leads to: (summation of each H j is over all states 1,...,K ) These coefficients need to be calculated efficiently! (see later)

HMM Parameter Estimation Omitting A -independent terms leads to: (summation of each H j is over all states 1,...,K ) These coefficients need to be calculated efficiently! (see later)

HMM Parameter Estimation Omitting A -independent terms leads to: (summation of each H j is over all states 1,...,K ) These coefficients need to be calculated efficiently! (see later)

HMM Parameter Estimation We have to maximize Q( Θ;Θ old ) with respect to A=(a rs ) under the constraints How to maximize a function under additional constraints? Reformulate side conditions as zeros of a function: The task is now: Under the constraints Maximize

Side Note: Method of Lagrange Multipliers Introducing Lagrange multiplicators and taking partial derivatives leads to Set the derivatives to zero and solve for λ r : This leads to Introducing Lagrange multiplicators and taking partial derivatives leads to Set the derivatives to zero and solve for λ r : This leads to must be perpendicular to the g(x)=0 hypersurface (red). Otherwise, it would have a component parallel to it, and moving along that component would not leave g(x) equal to 0. must be perpendicular to the g(x)=0 hypersurface. Otherwise, f(x) would be further maximized by moving along the gradient‘s component parallel to the hypersurface. Therefore, and must be parallel to each other, or, in other words f(x) for some Let x* be a maximum of under the constraint x1x1 x2x2 g(x 1,x 2 )=0 g(x)=0

HMM Parameter Estimation We have to maximize Q under the constraints Introducing Lagrange multiplicators and taking partial derivatives leads to Set the derivatives to zero and solve for λ r : This leads to

HMM Parameter Estimation We have to maximize Q under the constraints Introducing Lagrange multiplicators and taking partial derivatives leads to Set the derivatives to zero and solve for λ r : This leads to

HMM Parameter Estimation We have to maximize Q under the constraints Introducing Lagrange multiplicators and taking partial derivatives leads to Set the derivatives to zero and solve for λ r : This leads to

Forward-Backward Algorithm Still to do: Calculate the marginal posterior probabilities We first calculate the univariate marginal posterior Small proof required (exercise) Forward probabilities Backward probabilities Thus,

Forward-Backward Algorithm The forward and backward probabilities can be calculated recursively (Forward-Backward-algorithm):

Forward-Backward Algorithm From α and β, we derive ζ :

Forward-Backward Algorithm From α and β, we derive ζ :

Forward-Backward Algorithm From α and β, we derive ζ :

Forward-Backward Algorithm From α and β, we derive ζ :

Forward-Backward Algorithm From α and β, we derive ζ :

Baum-Welch Algortihm 1.Start with some initial parameter guess Θ old. 2.Calculate α, β and γ, ζ (Forward-Backward algorithm) 3.Update parameters Θ, in particular 4.Set Θ old = Θ and iterate 2. and 3. until convergence. Depending on the chosen emission distribution (see mixture clustering)

genomic position HMM Example Observation tracks, e.g. Genome-wide ChIP-chip occupancy

HMM Example Observation tracks, e.g. Genome-wide ChIP-chip occupancy

state 1state 2 state 3 state 4state 5 typical occupancy vector(s) transition matrix Viterbi path HMM Example

State annotation = Viterbi path (maximum likelihood path) ;Θ;Θ Viterbi Decoding Pr Likelihood function HMM parameters

Viterbi Decoding Viterbi decoding searches for the most probable hidden state path. It maximizes the joint posterior of H.

Viterbi Decoding Viterbi decoding searches for the most probable hidden state path. It maximizes the joint posterior of H. Let v j (r) be the probability of the most probable path ending in state H j =r with observations X 1,..., X j : … … … … … … ……… ………… j-1jN1 r 1 K Positions States … … …

Viterbi Decoding We find v j (r) iteratively by dynamic programming, starting with j=1 and ending with j=N. Suppose we have found v j-1 (r) for and all r. Then, … … … … … … ……… ………… j-1jN1 r 1 K Positions States … … …

Viterbi Decoding If we keep track of the previous state of the maximum probability path, we can reconstruct the maximum likelihood path: Along with v j (r), define the backtrack variable B j (r) : … … … … … … ……… ………… j-1jN1 r 1 K Positions States … … … Then, we find by backtracking:

Posterior Decoding Posterior decoding searches for the most probable hidden state for each individual position. It maximizes the marginal posterior for each H j. The marginal posteriors were already calculated in the forward- backward algorithm (applied with the known parameter set Θ ): Hence we have

Efficient Marginalization: Factor Graphs A factor graph consists of: 1. An undirected, bipartite graph 2. A set of local functions, one for each factor node. The local function is a function of its neighbouring variables, A B C D f AB f BC f BD With variable nodes representing random variables And factor nodes representing local functions Factor nodes Variable nodes The neighbours ne(X) of a node X are the nodes that are directly adjacent to X f AB By definition, a factor graph encodes the function Where X 1,…X n is the set of all variable nodes

Factor Graphs Factor graphs are very general tools, e.g. Bayesian networks can be written as factor graphs: A B C f ABC AB C Possible representations as a factor graph: A B C f AB fAfA f ABC

The Sum-Product Algorithm “Marginalisation” in factor graphs that are trees. Let X = {X 1,…,X n } We want to calculate AXjXj CD f AC f CX j fXjDfXjD ne(X j ) ne(f CX j ) message from f k to X j

The Sum-Product algorithm Objective: Calculate the „marginals“ (in our case, the distributions)

The Sum-Product algorithm

Example Step 1 (initialization)Step 2

The Sum-Product algorithm

Theorem: If the factor graph is a tree, the order in which messages are passed is irrelevant for the result. For graphs containing loops, the result depends on the message passing scheme. Messages are usually passed until convergence, but convergence is not guaranteed. Therefore it is desirable to construct factor graphs that are trees. The Sum-Product algorithm

HMM Generalizations

>20,000 time-lapse movies of RNAi knock-downs (histone-GFP tagged HeLa cells) Neumann, Ellenberg et al., Nature Carpenter et al., Genome Biol 2006 Cell Profiler Example: Phenotyping in Time-Lapse Imaging Mitocheck Database Generation of time-lapse movies Cell identification and tracking Feature extraction

Phenotype classesRaw data Example: Phenotyping in Time-Lapse Imaging time

The Tree Hidden Factor Graph Model... ? (f 1,f 2,…,f n )

The Tree Hidden Factor Graph Model (f 1,f 2,…,f n ) (Mixture) Clustering Structure of hidden variables Model parameters Emission distributions empty graph treeHFM tree / forest higher order transitions Hidden Markov Model line graph transitions Gerlich et al., Nat. Meth CellCognition

X : Hidden (Cell) States X1X1 X2X2 X3X3 ΓX1X2ΓX1X2 ΓX2X3ΓX2X3 D1D1 D2D2 D3D3 D : Observations (Phenotypes) HMMs in Factor Graph Notation ΨX1(D1)ΨX1(D1) ΨX2(D2)ΨX2(D2) ΨX3(D3)ΨX3(D3) Marginal Likelihood:

HMMs in Factor Graph Notation Kschischang, Frey, Loeliger, IEEE Trans. Information Theory 2001 X1X1 X2X2 X3X3 D1D1 D2D2 D3D3 G1G1 G2G2 G3G3 F1F1 F2F2 P(X 1 ) Factor graph represen- tation of the HMM

X2X2 X1X1 D1D1 D2D2 X3X3 D3D3 X4X4 D4D4 Genealogies in Factor Graph Notation

X2X2 X1X1 D1D1 D2D2 X3X3 D3D3 X4X4 D4D4

X1X1 D1D1 X2X2 D2D2 X3X3 D3D3 X4X4 D4D4 G1G1 F1F1 G3G3 G4G4 G2G2 F2F2 Factor nodes F j model transition probabilities We need an extended transition parameter set

Expectation-Maximization in the HFM The EM algorithm iteratively maximizes For HMMs, this can be expressed in terms of forward and backward probabilities In the spirit of HMMs, we are then able to derive an analytic (and fast) update formula. For HFMs, this can be expressed in terms of messages using factor graphs

HFM parameters: Transition probabilities Sequential transition probabilities Division probabilities

Summary Statistics Viterbi trees Transition graph relative cell cycle time frequency Cell cycle time distribution A B D Morphology (PCA plot of classes) C