Dynamic Programming & Hidden Markov Models. Alan Yuille Dept. Statistics UCLA.

Slides:



Advertisements
Similar presentations
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Advertisements

Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
A Tutorial on Learning with Bayesian Networks
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.
State Estimation and Kalman Filtering CS B659 Spring 2013 Kris Hauser.
Lauritzen-Spiegelhalter Algorithm
Introduction to Markov Random Fields and Graph Cuts Simon Prince
Exact Inference in Bayes Nets
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Dynamic Bayesian Networks (DBNs)
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Chapter 4 Probability and Probability Distributions
An Introduction to Variational Methods for Graphical Models.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
. Computational Genomics Lecture 10 Hidden Markov Models (HMMs) © Ydo Wexler & Dan Geiger (Technion) and by Nir Friedman (HU) Modified by Benny Chor (TAU)
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.
CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep
1 Reasoning Under Uncertainty Over Time CS 486/686: Introduction to Artificial Intelligence Fall 2013.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Genome evolution: a sequence-centric approach Lecture 5: Undirected models and variational inference.
Phylogenetic Trees Presenter: Michael Tung
Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
CS 188: Artificial Intelligence Fall 2009 Lecture 19: Hidden Markov Models 11/3/2009 Dan Klein – UC Berkeley.
CHAPTER 15 SECTION 3 – 4 Hidden Markov Models. Terminology.
Isolated-Word Speech Recognition Using Hidden Markov Models
Soft Computing Lecture 17 Introduction to probabilistic reasoning. Bayesian nets. Markov models.
THE HIDDEN MARKOV MODEL (HMM)
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
November 2004CSA4050: Crash Concepts in Probability1 CSA4050: Advanced Topics in NLP Probability I Experiments/Outcomes/Events Independence/Dependence.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
CS Statistical Machine learning Lecture 24
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
CPS 170: Artificial Intelligence Markov processes and Hidden Markov Models (HMMs) Instructor: Vincent Conitzer.
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Today Graphical Models Representing conditional dependence graphically
Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Other Models for Time Series. The Hidden Markov Model (HMM)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Tea – Time - Talks Every Friday 3.30 pm ICS 432. We Need Speakers (you)! Please volunteer. Philosophy: a TTT (tea-time-talk) should approximately take.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Qian Liu CSE spring University of Pennsylvania
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
CS 188: Artificial Intelligence Spring 2007
CS498-EA Reasoning in AI Lecture #20
HCI/ComS 575X: Computational Perception
CONTEXT DEPENDENT CLASSIFICATION
Expectation-Maximization & Belief Propagation
Presentation transcript:

Dynamic Programming & Hidden Markov Models. Alan Yuille Dept. Statistics UCLA

Goal of this Talk This talk introduces one of the major algorithms: dynamic programming (DP). This talk introduces one of the major algorithms: dynamic programming (DP). Then describe how it can be used in conjunction with EM for learning. Then describe how it can be used in conjunction with EM for learning. 1. Chair

Dynamic Programming Dynamic Programming exploits the graphical structure of the probability distribution. It can be applied to any structure without closed loops. Dynamic Programming exploits the graphical structure of the probability distribution. It can be applied to any structure without closed loops. Consider the two-headed coin example given in Tom Griffiths talk (Monday). Consider the two-headed coin example given in Tom Griffiths talk (Monday).

Probabilistic Grammars By the Markov Condition: By the Markov Condition: Hence we can exploit the graphical structure to efficiently compute: Hence we can exploit the graphical structure to efficiently compute: The structure means that the sum over x2 drops out. We need only sum over x1 and x3. Only four operations instead of eight.

Dynamic Programming Intuition Suppose you wish to travel to Boston from Los Angeles by car. Suppose you wish to travel to Boston from Los Angeles by car. To determine the cost of going via Chicago – you only need to calculate the shortest cost from Los Angeles to Chicago and then, independently, the shortest cost from Chicago to Boston. To determine the cost of going via Chicago – you only need to calculate the shortest cost from Los Angeles to Chicago and then, independently, the shortest cost from Chicago to Boston. Decomposing the route in this way gives an efficient algorithm which is polynomial in the number of nodes and feasible for computation. Decomposing the route in this way gives an efficient algorithm which is polynomial in the number of nodes and feasible for computation.

Dynamic Programming Diamond Compute the shortest cost from A to B. Compute the shortest cost from A to B.

Application to a 1-dim chain. Consider a distribution defined on a 1-dim chain. Consider a distribution defined on a 1-dim chain. Important property: directed and undirected graphs are equivalent (for 1-dim chain). Important property: directed and undirected graphs are equivalent (for 1-dim chain). P(A,B) = P(A|B) P(B) P(A,B) = P(A|B) P(B) or P(A,B) = P(B|A) P(A) or P(A,B) = P(B|A) P(A) For these simple graphs with two nodes -- you cannot distinguish causation from correlation without intervention (Wu’s lecture Friday). For these simple graphs with two nodes -- you cannot distinguish causation from correlation without intervention (Wu’s lecture Friday). For this lecture – we will treat a simple one-dimensional cover directed and undirected models simultaneously. (Translating between directed and undirected is generally possible for graphs without closed loops – but has subtleties). For this lecture – we will treat a simple one-dimensional cover directed and undirected models simultaneously. (Translating between directed and undirected is generally possible for graphs without closed loops – but has subtleties).

Probability distribution on 1-D chain

1-D Chain.

1-Dim Chain: (Proof by induction). (Proof by induction).

1-Dim Chain We can also use DP to compute other properties: e.g. to convert the distribution from undirected form: We can also use DP to compute other properties: e.g. to convert the distribution from undirected form: To directed form: To directed form:

1-Dim Chain

Special Case: 1-D Ising Spin Model

Dynamic Programming Summary Dynamic Programming can be applied to perform inference on all graphical models defined on trees –The key insight is that, for trees, we can define an order on the nodes (not necessarily unique) and process nodes in sequence (never needing to return to a node that have already been processed). Dynamic Programming can be applied to perform inference on all graphical models defined on trees –The key insight is that, for trees, we can define an order on the nodes (not necessarily unique) and process nodes in sequence (never needing to return to a node that have already been processed).

Extensions of Dynamic Programming: What to do if you have a graph with closed loops? What to do if you have a graph with closed loops? There are a variety of advanced ways to exploit the graphical structure and obtain efficient exact algorithms. There are a variety of advanced ways to exploit the graphical structure and obtain efficient exact algorithms. Prof. Adnan Darwiche (CS, UCLA) is an expert on this topic. There will be an introduction to his SamIam code. Prof. Adnan Darwiche (CS, UCLA) is an expert on this topic. There will be an introduction to his SamIam code. Also can use approximate methods like BP. Also can use approximate methods like BP.

Junction Trees. It is also possible to take a probability distribution defined on a graph with closed loops and reformulate it as a distribution on a new nodes without closed loops. (Lauritzen and Spiegelhalter 1990). It is also possible to take a probability distribution defined on a graph with closed loops and reformulate it as a distribution on a new nodes without closed loops. (Lauritzen and Spiegelhalter 1990). This lead to a variety of algorithm generally known as junction trees. This lead to a variety of algorithm generally known as junction trees. This is not a universal solution – because the resulting new graphs may have too many nodes to make them practical. This is not a universal solution – because the resulting new graphs may have too many nodes to make them practical. Google “junction trees” to find nice tutorials on junction trees. Google “junction trees” to find nice tutorials on junction trees.

Graph Conversion Convert graph by a set of transformations. Convert graph by a set of transformations.

Triangles & Augmented Variables From triangles to ordered triangles. From triangles to ordered triangles. Original Variables: LoopsAugmented Variables: No Loops

Summary of Dynamic Programming. Dynamic Programming can be used to efficiently compute properties of a distribution for graphs defined on trees. Dynamic Programming can be used to efficiently compute properties of a distribution for graphs defined on trees. Directed graphs on trees can be reformulated as undirected graphs on trees, and vice versa. Directed graphs on trees can be reformulated as undirected graphs on trees, and vice versa. DP can be extended to apply to graphs with closed loops by restructuring the graphs (junction trees). DP can be extended to apply to graphs with closed loops by restructuring the graphs (junction trees). It is an active research area to determine efficient inference algorithms which exploit the graphical structures of these models. It is an active research area to determine efficient inference algorithms which exploit the graphical structures of these models. Relationship between DP and reinforcement learning (week 2). Relationship between DP and reinforcement learning (week 2). DP and A*. DP and pruning. DP and A*. DP and pruning.

HMM’s: Learning and Inference So far we have considered inference only. So far we have considered inference only. This assumes that the model is known. This assumes that the model is known. How can we learn the model? How can we learn the model? For 1D models -- this uses DP and EM. For 1D models -- this uses DP and EM.

A simple HMM for Coin Tossing Two coins, one biased and the other fair, with the coins switched occasionally. Two coins, one biased and the other fair, with the coins switched occasionally. The observable 0,1 is whether the coin is head or tails. The observable 0,1 is whether the coin is head or tails. The hidden state A,B is which coin is used. The hidden state A,B is which coin is used. There are unknown transition probabilities between the hidden states A and B, and unknown probabilities for the observations conditioned on the hidden states. There are unknown transition probabilities between the hidden states A and B, and unknown probabilities for the observations conditioned on the hidden states. The learning task is to estimate these probabilities from a sequence of measurements. The learning task is to estimate these probabilities from a sequence of measurements.

HMM for Speech

HMM Summary HMM define a class of markov models with hidden variables. Used for speech recognition, and many other applications. HMM define a class of markov models with hidden variables. Used for speech recognition, and many other applications. Tasks involving HMM’s involve learning, inference, and model selection. Tasks involving HMM’s involve learning, inference, and model selection. These can often be performed by algorithms based on EM and DP. These can often be performed by algorithms based on EM and DP.