An introduction to Graphical Models – Michael Jordan

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Graphical Models and Applications CNS/EE148 Instructors: M.Polito, P.Perona, R.McEliece TA: C. Fanti.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Deep Learning Bing-Chen Tsai 1/21.
Expectation Maximization
Supervised Learning Recap
An Introduction to Variational Methods for Graphical Models.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Markov Networks.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
BAYESIAN INFERENCE Sampling techniques
Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
Variational Inference and Variational Message Passing
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Conditional Random Fields
Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.
Visual Recognition Tutorial
Computer vision: models, learning and inference Chapter 10 Graphical Models.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Computer vision: models, learning and inference
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Lecture 19: More EM Machine Learning April 15, 2010.
Markov Random Fields Probabilistic Models for Images
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto.
CS Statistical Machine learning Lecture 24
Lecture 2: Statistical learning primer for biologists
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Pattern Recognition and Machine Learning
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
Hidden Markov Models BMI/CS 576
Introduction to Sampling based inference and MCMC
Learning Deep Generative Models by Ruslan Salakhutdinov
Advanced Statistical Computing Fall 2016
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Non-Parametric Models
Statistical Models for Automatic Speech Recognition
Multimodal Learning with Deep Boltzmann Machines
Machine Learning Basics
Jun Liu Department of Statistics Stanford University
Markov Networks.
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
CSCI 5822 Probabilistic Models of Human and Machine Learning
Statistical Models for Automatic Speech Recognition
Instructors: Fei Fang (This Lecture) and Dave Touretzky
An Introduction to Variational Methods for Graphical Models
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Stochastic Optimization Maximization for Latent Variable Models
Expectation-Maximization & Belief Propagation
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Markov Networks.
Presentation transcript:

An introduction to Graphical Models – Michael Jordan Graphical Models (II) Dong Wang and Yang Feng An introduction to Graphical Models – Michael Jordan CMU class 10708 – Eric Xing Probabilistic Graphical Models : Principles and Techiques – Daphne Koller

Outline Review of what we learned Start from simple models EM framework Approximate inference

What are graphical models? Graphical models are probabilistic models represented in a graph form Graphs represent (1) global structure (2) local dependency Statistical independence can be read directly The joint probability can be read directly Inference can be conducted in a graphical way Directed graphs (Bayesian networks) focuses on explicit dependency, undirected graphs (Markov fields) focuses on an implicit one. They are not equal, but in many cases can be converted from each other.

Graphical models and Neural network They are different ways of balancing knowledge and experience Some models belong to both, e.g., stochastic NN and RBM Graphical models are more generative, while NN is more used as discriminative

Supervised learning or unsupervised learning? Graphical models can be either supervised or supervised Targets are treated no difference from explaining variables It is descriptive (generative), but descriptive for all variables – funny?

Some examples Density estimation Regression Classification Generative and descriptive Regression Generative, linear or non linear, Parametric or non-parametric Classification Generative and Discriminative

Inference in graphical model Assume a model is ready. Given a set of variable, what the distribution of other variables? Can be marginal distribution P(V) or conditional distribution P(H|V) Junction tree is one of the general exact inference approaches. But it could be very complex (N2T). For complex models, we need approximated inference, including sampling and variational inference

Learning graphical models Given a set of training examples for a set of observed variables Find the best model (most likely?)

Outline Review of what we learned Start from simple models EM framework Inference Maximization

We start from some simple models Gaussian mixture model (GMM) Probabilistic PCA Probabilistic linear discriminative analysis (PLDA) Hidden Markov models (HMM) Restrictive Boltzmann machines (RBM) Conditional random filed (CRF) Latent Dirichilet Allocation (LDA)

Gaussian mixture models Many variables are distributed with multiple modes Need a model to describe such data profile An unsupervised model without any supervision The most popular one to describe statistical data

GMM learning The main difficulty of GMM Solution An EM algorithm Hidden (latent) variables! Solution Make the hidden more explicit, in the way of probability estimation An EM algorithm E step: compute posteriors of unknown variables M step: estimate model parameters with the estimated variables Iterate until converge (?) http://www.juergenwiki.de/work/wiki/doku.php?id=public:gaussian_mixture_models_em_algorithm_-_demo

GMM is a graphical model Inference is a simple version of junction tree. φ Hidden! z N x μ,Σ

Hidden Markov model A basic temporal model Markov assumption Conditional independent assumption Inference: given Y, what X? (Baum-Welch or Viterbi) Parameter estimation: given Y, what parameters makes P(Y) maximum? (Baum-Welch) Again, difficulty resides in hidden variables (states) EM applies again

It is a graphical model… Unfold the state graph to a graphical model Inference is by Forward-backward (Baum Welch), which is a special case of junction tree algorithm. Viterbi is simple version of forward-back. Training is a simple EM.

Restricted Boltzmann Machine RBM is a two-layer random field, without inter-layer connections. An energy model, a bi-directional stochastic NN. Inference is easy

RBM Training Parameter optimization is conducted by gradient descent. Empirical evidence model assumption

Contrastive divergence The goal is to put the energy on the training data low, and other areas high Using MCMC to sample ‘negative’ examples to push its energy high Estimation of the true model generation

Outline Review of what we learned Start from simple models EM framework Approximate inference

It is time to think it more We have seen diverse types of graphical models, and each seem has a particular training method. We (may) have remembered numerous names: K-mean, EM, CD, Viterbi, dynamic programing, junction tree, Baum-Welch, MCMC, SGD @$@%! What are they????

Come back to difficulties Some variables are random and hidden (latent) Some dependences are complex Some computations are intractable

Possible solutions Some variables are random and hidden (latent) Use it’s posterior probability instead of exact values Use the expectation as the objective! Some dependences are complex Use simple relations to approximate Some computations are intractable Resort to numerical optimizations

A frame work you must remember Expectation (junction tree, Baum-Welch, MCMC, variational…, mostly inference) Maximization (SGD, Newton, conjugate gradient, Hessian free, LBFGS) E-M algorithm

Revisit GMM Expectation Maximization Compute posterior P(Ci|X) --inference Compute expectation ∑P(Ci|X)p(X|μiΣi) Maximization Close-form solution for μi, Σi, φi φ Hidden! z N x μ,Σ

Revisit HMM Expectation Maximization Compute P(St|O) for each t, using Baum-Welch Compute expectation ∑P(S|O)p(O|S,A,b,π) Maximization Closed-form solution for A,b,π

Revisit RBM Expectation Maximization Compute P(h|v) Compute the expectation Maximization No closed-form, due to the partition on Z Using gradient descend Not enough, still computational intractable Using sampling

But things may be more complexed So far, all the inference is simple In complex graphical models, inference (posterior computation) could be intractable We will discuss approximation inference method a bit late Now let’s ask the question: will the EM procedure converge to what we want?

See more details about EM KL >=0 All variables are observable This is the EXPECTATION, and we want to maxiize it!

Maximize expectation L(q,θ) is a lower-bound of L(θ) They match when at θ where q(z;θ)=P(z|X, θ) Maximizing L(q, θ) is easier as it is simple Converge to local minimum

Outline Review of what we learned Start from simple models EM framework Approximate Inference

Intractable graphical models Exact inference for some graphical models is tractable Chain-like graphs Tree-like graphs Many graphical models do not have tractable exact inference High dimensions . Complexity of message passing is O(TN2) Complex forms of posterior probabilities

Two approximated inference Sampling approach Use samples to represent posteriors or marginals Variational approach Use simpler functions to approximate posterios or marginals

Sampling approach A graphical model is ‘generative’, and it can generate samples Given a set of examples, we can compute statistics in a non-parametric or parametric way Marginals, by ignoring uninterested variables Conditionals, by categorizing the samples according to the values of the condtional varaibles Directed graphical models can perform the sampling from parents to children Undirected graphical models are not easy Even the sampling is easy, it is usually highly inefficient, by wasting many samples.

Markov Chain Monte Carlo (MCMC) Design a Markov chain, let the chain converge to a target distribution, then everything is simple. A Markov china satisfies the Markov property Then define transition probability of the chain Marginal distribution A distribution A is invariant with respect to a Markov chain F, if each step the chain F generates the marginal distribution as A.

Metropolis-Hastings A chain converges to p(z) if its reversable To make the chain converge to p(z) in spite the initial state, it should be ergodic. It can be shown that a homogeneous Markov chain (not change over time) will be ergodic, subject only to weak restrictions on the invariant distribution and the transition probabilities. It can be proofed that design a simple transition q(z|zt-1) with an appropriate rejection criterion involving p(z), leads to a reversable and ergodic chain, with respect to the target p(z). This is called Metropolis-Hastings algorithm.

Gibbs Sampling A simple version of Hastings algorithm. Sampling a particular hidden variable at each time. It equals to the case where q(zt|zt-1)=p(zk|z\k) , and the acceptance rate is 1.

Determine the conditional Need to check which variables the target variable depends on at each step The set of variables a target variable depends on is called Markov blanket.

Markov blanket

Some problems of Gibbs sampling It can take long time to converge It is not simple to tell if it has converged Successive samples are dependent. We can choose samples after every M sampling steps.

How it is used in inference and EM Bayesian prediction Maximum posterior estimation in Laplacian approximation Expectation in EM

Variational approach Design a simple probability function to estimate the true posterior.

Factorized distributions The variational functions can be any form, but better to keep them as general as possible Factorized distributions introduce only weak assumption, do not define the distribution family

Optimize with respect to each factor

Variational result Expectation over q!

A simple example Gaussian factorization Gaussian!

Some other examples Variational mixture of Gaussian LDA and HLDA

Cons and prons Varational approach is generally fast than sampling, but still involves an iterative procedure But it requires much design, most of time hard We described a simpler variational approach that uses a deep NN to map variables to a space where the distribution is simple. But not much work how to infer variables in the new space – it is just for generation right now.

Wrap up Graphical model is a structured model involves rich knowledge. It is a basic framework for complex inference. Many many models we used everyday belongs to graphical models. A small set of graphical models can be inferred exactly with algorithms such as junction tree message passing Most of the graphical models resort to approximate inference, particularly sampling and varitaional methods. No matter how the inference is conducted, the EM is a general framework for model training.