2010 Winter School on Machine Learning and Vision Sponsored by Canadian Institute for Advanced Research and Microsoft Research India With additional support.

Slides:



Advertisements
Similar presentations
Real-time on-line learning of transformed hidden Markov models Nemanja Petrovic, Nebojsa Jojic, Brendan Frey and Thomas Huang Microsoft, University of.
Advertisements

Part 2: Unsupervised Learning
Clustering. How are we doing on the pass sequence? Pretty good! We can now automatically learn the features needed to track both people But, it sucks.
Bayesian Belief Propagation
University of Toronto Oct. 18, 2004 Modelling Motion Patterns with Video Epitomes Machine Learning Group Meeting University of Toronto Oct. 18, 2004 Vincent.
Unsupervised Learning
Supervised Learning Recap
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
Joint Estimation of Image Clusters and Image Transformations Brendan J. Frey Computer Science, University of Waterloo, Canada Beckman Institute and ECE,
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function Networks
Crash Course on Machine Learning
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
SVCL Automatic detection of object based Region-of-Interest for image compression Sunhyoung Han.
University of Toronto Aug. 11, 2004 Learning the “Epitome” of a Video Sequence Information Processing Workshop 2004 Vincent Cheung Probabilistic and Statistical.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Markov Random Fields Probabilistic Models for Images
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
CS Statistical Machine learning Lecture 24
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Bayesian Belief Propagation for Image Understanding David Rosenberg.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Feedforward Networks
Graduate School of Information Sciences, Tohoku University
LECTURE 11: Advanced Discriminant Analysis
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
LOCUS: Learning Object Classes with Unsupervised Segmentation
Latent Variables, Mixture Models and EM
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Graduate School of Information Sciences, Tohoku University
Markov Random Fields Presented by: Vladan Radosavljevic.
Expectation-Maximization & Belief Propagation
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen
The EM Algorithm With Applications To Image Epitome
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

2010 Winter School on Machine Learning and Vision Sponsored by Canadian Institute for Advanced Research and Microsoft Research India With additional support from Indian Institute of Science, Bangalore and The University of Toronto, Canada

Outline 1.Approximate inference: Mean field and variational methods 2.Learning generative models of images 3.Learning epitomes of images

Part A Approximate inference: Mean field and variational methods

Line processes for binary images (Geman and Geman 1984) 0 Function, Patterns with high Patterns with low Under P, lines are probable

Use tablet to derive variational inference method

Denoising images using line process models

Part B Learning Generative Models of Images Brendan Frey University of Toronto and Canadian Institute for Advanced Research

Generative models Generative models are trained to explain many different aspects of the input image –Using an objective function like log P(image), a generative model benefits by account for all pixels in the image Contrast to discriminative models trained in a supervised fashion (eg, object recognition) –Using an objective function like log P(class|image), a discriminative model benefits by accounting for pixel features that distinguish between classes

Rejecting a common criticism of generative models Critics of generative models often point to supposedly well-defined tasks (eg, object recognition) and claim that discriminatively trained classifiers (eg, SVMs) perform better My experience is that such critics usually study artificially simplistic tasks that are mostly unrelated to real-world tasks –Hand-written digit recognition –Identifying cows in PASCAL or Caltech256 images –Classifying gene function using microarray data

Accepting a common criticism of generative models Generative models do well at automatically answering diverse and even unexpected queries, but on specific supervised learning tasks, discriminative methods (eg, SVMs) usually perform better

What constitutes an image Uniform 2-D array of color pixels Uniform 2-D array of grey-scale pixels Non-uniform images (eg, retinal images, compressed sampling images) Features extracted from the image (eg, SIFT features) Subsets of image pixels selected by the model (must be careful to represent universe) …

What constitutes a generative model?

Learning Bayesian Networks: Exact and approximate methods

Maximum likelihood learning when all variables are visible (complete data) Suppose we observe N IID training cases v (1) …v (N) Let be the parameters of a model P(v) Maximum likelihood estimate of : ML = argmax n P(v (n) | ) = argmax log( n P(v (n) | ) ) = argmax n log P(v (n) | )

Complete data in Bayes nets All variables are observed, so P(v| ) = i P(v i |pa i, i ) where pa i = parents of v i, i parameterizes P(v i |pa i ) Since argmax () = argmax log (), i ML = argmax i n log P(v (n) | ) = argmax i n i log P(v i (n) |pa i (n), i ) = argmax i n log P(v i (n) |pa i (n), i ) Each child-parent module can be learned separately

Example: Learning a Mixture of Gaussians from labeled data Recall: For cluster k, the probability density of x is The probability of cluster k is p(z k = 1) = k Complete data: Each training case is a (z n,x n ) pair, let N k be the number of cases in class k ML estimation:, That is, just learn one Gaussian for each class of data

Example: Learning from complete data, a continuous child with continuous parents Estimation becomes a regression-type problem Eg, linear Gaussian model: P(v i |pa i, i ) = N (v i ; w i0 + n:o n pa i w in v n,C i ), mean = linear function of parents Estimation: Linear regression

Learning fully-observed MRFs It turns out we can NOT directly estimate each potential using only observations of its variables P(v| ) = i ϕ (v C i | i ) / ( v i ϕ (v C i | i )) Problem: The partition function (denominator)

Learning Bayesian networks when there is missing data

Example: Mixture of K unit-variance Gaussians P(x) = k k exp(-(x- 1 ) 2 /2), where = (2 ) -1/2 The log-likelihood to be maximized is log ( k k exp(-(x- 1 ) 2 /2) ) The parameters { k, k } that maximize this do not have a simple, closed form solution One approach: Use nonlinear optimizer This approach is intractable if the number of components is too large A different approach…

The expectation maximization (EM) algorithm (Dempster, Laird and Rubin 1977) Learning was more straightforward when the data was complete Can we use probabilistic inference (compute P(h|v, )) to fill in the missing data and then use the learning rules for complete data? YES: This is called the EM algorithm

Initialize (randomly or cleverly) E-Step: Compute Q (n) (h) = P(h|v (n), ) for hidden variables h, given visible variables v M-Step: Holding Q (n) (h) constant, maximize n h Q (n) (h) log P(v (n),h| ) wrt Repeat E and M steps until convergence Each iteration increases log P(v| ) = n log ( h P(v,h| ) ) Expectation maximization (EM) algorithm Ensemble completion

Recall P(v,h| ) = i P(x i |pa i, i ), x = (v,h) Then, maximizing n h Q (n) (h) log P(v (n),h| ) wrt i becomes equivalent to maximizing, for each x i, n x i,pa i Q (n) (x i,pa i ) log P(x i |pa i, i ) where Q(..., x k =x k *,…)=0 if x k is observed to be x k * GIVEN the Q-distributions, the conditional P- distributions can be updated independently EM in Bayesian networks

E-Step: Compute Q (n) (x i,pa i ) = P(x i,pa i |v (n), ) for each variable x i M-Step: For each x i, maximize n x i,pa i Q (n) (x i,pa i ) log P(x i |pa i, i ) wrt i EM in Bayesian networks

EM for a mixture of Gaussians Initialization: Pick s, s, s randomly but validly E Step: For each training case, we need q(z) = p(z|x) = p(x|z)p(z) / ( z p(x|z)p(z)) Defining = q(z nk =1 ), we need to actually compute: M Step: Do it in the log-domain! Recall: For labeled data, (z nk )=z nk

EM for mixture of Gaussians: E step c z 1 = 0.5, 2 = 0.5, Images from data set

c =1 Images from data set z=z= c =2 P(c|z) c = 0.5, 2 = 0.5, EM for mixture of Gaussians: E step

Images from data set z=z= c c =1 c =2 P(c|z) = 0.5, 2 = 0.5, EM for mixture of Gaussians: E step

Images from data set z=z= c c =1 c =2 P(c|z) = 0.5, 2 = 0.5, EM for mixture of Gaussians: E step

Images from data set z=z= c c =1 c =2 P(c|z) = 0.5, 2 = 0.5, EM for mixture of Gaussians: E step

c 1 = 0.5, 2 = 0.5, z Set 1 to the average of zP(c =1 |z) Set 2 to the average of zP(c =2 |z) EM for mixture of Gaussians: M step

c 1 = 0.5, 2 = 0.5, z Set 1 to the average of zP(c =1 |z) Set 2 to the average of zP(c =2 |z) EM for mixture of Gaussians: M step

c 1 = 0.5, 2 = 0.5, z Set 1 to the average of diag((z- 1 ) T (z- 1 ))P(c =1 |z) Set 2 to the average of diag((z- 2 ) T (z- 2 ))P(c =2 |z) EM for mixture of Gaussians: M step

c 1 = 0.5, 2 = 0.5, z Set 1 to the average of diag((z- 1 ) T (z- 1 ))P(c =1 |z) Set 2 to the average of diag((z- 2 ) T (z- 2 ))P(c =2 |z) EM for mixture of Gaussians: M step

… after iterating to convergence: c z 1 = 0.6, 2 = 0.4,

Why does EM work?

Gibbs free energy Somehow, we need to move the log() function in the expression log( h P(h,v)) inside the summation to obtain log P(h,v), which simplifies We can do this using Jensens inequality: Free energy

Properties of free energy F - log P(v) The minimum of F w.r.t Q gives F = - log P(v) Q(h) = P(h|v) = -

Proof that EM maximizes log P(v) (Neal and Hinton 1993) E-Step: By setting Q(h)=P(h|v), we make the bound tight, so that F = - log P(v) M-Step: By maximizing h Q(h) logP(h,v) wrt the parameters of P, we are minimizing F wrt the parameters of P Since -log P new (v) F new F old = -log P old (v), we have log P new (v) log P old (v). = -

Generalized EM M-Step: Instead of minimizing F wrt P, just decrease F wrt P E-Step: Instead of minimizing F wrt Q (ie, by setting Q(h)=P(h|v)), just decrease F wrt Q Approximations –Variational techniques (which decrease F wrt Q) –Loopy belief propagation (note the phrase loopy) –Markov chain Monte Carlo (stochastic …) = -

Summary of learning Bayesian networks Observed variables decouple learning in different conditional PDFs In contrast, hidden variables couple learning in different conditional PDFs Learning models with hidden variables entails iteratively filling in hidden variables using exact or approximate probabilistic inference, and updating every child-parent conditional PDF

Back to… Learning Generative Models of Images Brendan Frey University of Toronto and Canadian Institute for Advanced Research

What constitutes an image Uniform 2-D array of color pixels Uniform 2-D array of grey-scale pixels Non-uniform images (eg, retinal images, compressed sampling images) Features extracted from the image (eg, SIFT features) Subsets of image pixels selected by the model (must be careful to represent universe) …

Model size 1 class 2 classes 3 classes 4 classes Experiment: Fitting a mixture of Gaussians to pixel vectors extracted from complicated images

Why didnt it work? Is there a bug in the software? –I dont think so, because the log-likelihood monotonically increases and the software works properly for toy data generated from a mixture of Gaussians Is there a mistake in our mathematical derivation? –The EM algorithm for a mixture of Gaussians has been studied by many people – I think the math is ok

Why didnt it work? Are we missing some important hidden variables? YES: The location of each object

x z T T=T= c c=1c=1 P(c) = c P(x|z,T) = N(x; Tz, ) diag( ) = 1 = 0.6, 2 = 0.4, Shift, T P(T) diag( ) = z=z= x=x= Diagonal Transformed mixtures of Gaussians (TMG) (Frey and Jojic, )

EM for TMG x T c z E step: Compute Q(T)=P(T|x), Q(c)=P(c|x), Q(c,z)=P(z,c|x) and Q(T,z)=P(z,T|x) for each x in data M step: Set – c = avg of Q(c) – T = avg of Q(T) – c = avg mean of z under Q(z|c) – c = avg variance of z under Q(z|c) – = avg var of x-Tz under Q(T,z)

Model size 1 class 2 classes 3 classes 4 classes Random initialization Experiment: Fitting transformed mixtures of Gaussians to complicated images

Lets peek into the Bayes net (different movie) E[z|x] P(c|x) argmax c P(c|x) argmax T P(T|x) E[Tz|x] x

tmgEM.m is available on the web

Accounting for multiple objects in the same image

How can we compose an image that includes multiple objects? In TMG, the foreground and background were distinguished by the noise map When there are multiple objects, we need a way to assign each pixel to one object diag( ) = 1 = 0.6, 2 = 0.4, diag( ) =

Layered 2.5-D representations Adelson and Anandan (1990) described image patches as a composition of 2-D layers: Image = mask x foreground picture + (1-mask) x background picture

= ( + ) + Example

A generative model for layered vision (Jojic and Frey 2001, Frey, Kannan and Jojic, 2003)

Movies

A generative model for layered vision (Jojic and Frey 2001, Frey, Kannan and Jojic, 2003)

Random variables Appearance and transparency of layer l:, Class of layer l: Contribution that layer l makes to the input: Transformation of layer l: Contribution of layer l, including transformations: Subspace coordinate, layer l: Image

Probability model Image model: Model of hidden variables: Joint pdf:

Efficient probabilistic reasoning and learning Approximate posterior using q-distribution: where

Optimizing q Minimize the variational free energy: Algorithm: –Initialize variational parameters –Select a variational parameter or a model parameter, and adjust it so as to minimize F –Repeat until convergence

Inference updates Q(c) update: Introduce auxiliary variables: Rewriting the image model:

T update: s update: m update: Inference updates

Learning updates Image variances: Other params:

Movies

Inferring leg motion

Accounting for local image features using epitomes A good way to model local image features is to factorize them (cf Brunos talk) A simpler method is to cluster them (Freeman and Pasztor, 1999) A generative model based on clustered image patches needs a way to account for how image patches are coordinated

Learning the epitome of an image (Jojic, Kannan and Frey, ICCV 2003)

Movies

An EM-type learning algorithm

Gaussian likelihood model Patch k Map from epitome to image Set of pixel indices in patch k Epitome (parameters)

EM for Gaussian likelihood model M Step E Step

Examples

mean variance

Why epitomes are interesting & useful Generative model of multi-sized patches Invariant to transformations (eg, affine) Organizes and compresses patch data Proximity of patches in epitome probability patches belong to same part Incomplete observations are stitched into a single model Can model patterns at a wide range of scales Searching an epitome is much faster than searching an equivalent library of patches

Applications Data compression Data summarization and user interface Denoising Parts-based image modeling Segmentation …

Application: Image editing

Application: Denoising Original imageNoisy image Denoising using mixture of Gaussians Denoising using epitome SNR=13dBSNR=18.4dBSNR=19.2dB Wiener filter: 16.1dB Comparison: 80x80 epitome vs mixture of 1000 Gaussians Epitome learning is 10 times faster per iteration of EM Epitome achieves a higher SNR and a better image

Using epitomes to model images with multiple parts EmEm EsEs S1S1 S2S2 M X X=M*S 1 +(1-M)*S 2 + noise appearance epitome shape epitome

Application: Segmenting parts (Anitha Kannan) emem eses S1S1 s2s2 M x Epitome

Joint distribution and inference Exact inference is intractable, so we minimize the free energy using a variational method: Variational approximation to posterior

Parameterizing the approximation to the posterior Delta Dirac Estimated value for all patches containing i Idea: We should fit the relaxed model under the constraint that the generative patches agree in the overlapping regions

The bound simplifies Posterior Bound For this q function, the bound simplifies into a sum of quadratic terms than can be efficiently optimized by a generalized EM

Variational Inference Iterate:

Application: Segmenting parts

Another example

A tough synthetic example Synthetic tough test case – gray level image with the same mean and variance in the two regions

Video epitomes (Cheung, Frey, Jojic ICCV 2005)

Examples of video epitomes Temporally compressed epitome Spatially compressed epitome

Current work Replace sprites and masks with hierarchical patch-based models of local image and shape features Learn models from millions of images, taken from videos

Summary Generative models are trained to explain many different aspects of the input image –Using an objective function like log P(image), a generative model benefits by account for all pixels in the image Contrast to discriminative models trained in a supervised fashion (eg, object recognition) –Using an objective function like log P(class|image), a discriminative model benefits by accounting for pixel features that distinguish between classes

Rejecting a common criticism of generative models Critics of generative models often point to well- defined tasks (eg, object recognition) and claim that discriminatively trained classifiers (eg, SVMs) perform better I would argue that these tasks are often not representative of real-world problems –Hand-written digit recognition –Identifying cows in PASCAL or Caltech256 images –Classifying gene function using microarray data

Accepting a common criticism of generative models On specific supervised learning tasks, discriminative methods (eg, SVMs) usually perform better

Current work Replace sprites and masks with hierarchical patch-based models of local image and shape features Learn models from millions of images, taken from videos