A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images.

A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images J. Sivic, B. Russell, A. Efros, A. Zisserman and B. Freeman. ICCV 2005 Tomasz Malisiewicz tomasz@cmu.edu Advanced Machine Perception February 2006

Graphical Models: Recent Trend in Machine Learning Describing Visual Scenes using Transformed Dirichlet ProcessesTransformed Dirichlet Processes. E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec. 2005.

Outline Goals of both vision papers Techniques from statistical text modeling - pLSA vs LDA Scene Classification via LDA Object Discovery via pLSA

Goal: Learn and Recognize Natural Scene Categories Classify a scene without first extracting objects Other techniques we know of: -Global frequency (Oliva and Torralba) -Texton Histogram (Renninger, Malik et al)

Goal: Discover Object Categories Discover what objects are present in a collection of images in an unsupervised way Find those same objects in novel images Determine what local image features correspond to what objects; segmenting the image

Enter the world of Statistical Text Modeling D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003. Bag-of-words approaches: the order of words in a document can be neglected Graphical Model Fun

Bag-of-words A document is a collection of M words A corpus (collection of documents) is summarized in a term-document matrix

Object Bag of ‘words’

1990: Latent Semantic Analysis (LSA) Goal: map high-dimensional count vectors to a lower dimensional representation to reveal semantic relations between words The lower dimensional space is called the latent semantic space Dim( latent space ) = K

1990: Latent Semantic Analysis (LSA) D = {d 1,…,d N } N documents W = {w 1,…,w M } M words N ij = #(d i,w j ) NxM co-occurrence term-document matrix NxM = NxK x KxK x KxM documents words topics

What did we just do? NxM = NxK x KxK x KxM documents words topics Singular Value Decomposition

LSA summary SVD on term-document matrix Approximate N by thresholding all but the largest K singular values in W to zero Produces rank-K optimal approximation to N in the L 2 -matrix or Frobenius norm sense

LSA and Polysemy Polysemy: the ambiguity of an individual word or phrase that can be used (in different contexts) to express two or more different meanings Under the LSA model, the coordinates of a word in latent space can be written as a linear superposition of the coordinates of the documents that contain the word According to this superposition principle, LSA is unable to capture multiple senses of a word

Problems with LSA LSA does not define a properly normalized probability distribution No obvious interpretation of the directions in the latent space From statistics, the utilization of L 2 norm in LSA corresponds to a Gaussian Error assumption which is hard to justify in the context of count variables Polysemy problem

pLSA to the rescue Probabilistic Latent Semantic Analysis pLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model

Observed word distributions word distributions per topic Topic distributions per document Slide credit: Josef Sivic pLSA to the rescue Decomposition into Probabilities!

Maximize likelihood of data using EM. Minimize KL divergence between empirical distribution and model Observed counts of word i in document j Learning the pLSA parameters Slide credit: Josef Sivic Unlike LSA, pLSA does not minimize any type of ‘squared deviation.’ The parameters are estimated in a probabilistically sound way.

EM for pLSA (training on a corpus) E-step: compute posterior probabilities for the latent variables M-step: maximize the expected complete data log-likelihood

Graphical View of pLSA pLSA is a generative model Select a document d i with prob P(d i ) Pick latent class z k with prob P(z k |d i ) Generate word w j with prob P(w j |z k ) wd z Observed variables Plates Latent variables

How does pLSA deal with previously unseen documents? “Folding-in” Heuristic First train on Corpus to obtain Now re-run same training EM algorithm, but don’t re-estimate and let D={d unseen }

Problems with pLSA Not a well-defined generative model of documents; d is a dummy index into the list of documents in the training set (as many values as documents) No natural way to assign probability to a previously unseen document Number of parameters to be estimated grows with size of training set

LDA to the rescue Latent Dirichlet Allocation treats the topic mixture weights as a k-parameter hidden random variable and places a Dirichlet prior on the multinomial mixing weights Dirichlet distribution is conjugate to the multinomial distribution (most natural prior to choose: the posterior distribution is also a Dirichlet!) pLSA LDA

Corpus-Level parameters in LDA Alpha and beta are corpus-level documents that are sampled once in the corpus creating generative model (outside of the plates!) Alpha and beta must be estimated before we can find the topic mixing proportions belonging to a previously unseen document LDA

Getting rid of plates    zNzN z3z3 z2z2 z1z1 wNwN w3w3 w2w2 w1w1  zNzN z3z3 z2z2 z1z1 wNwN w3w3 w2w2 w1w1  zNzN z3z3 z2z2 z1z1 wNwN w3w3 w2w2 w1w1 Thanks to Jonathan Huang for the un-plated LDA graphic

Inference in LDA Inference = estimation of document-level parameters Intractable to compute  must employ approximate inference

Approximate Inference in LDA Variational Methods: Use Jensen’s inequality to obtain a lower bound on the log likelihood that is indexed by a set of variational parameters Optimal Variational Parameters (document- specific) are obtained by minimizing the KL divergence between the variational distribution and the true posterior Variational distribution Variational Methods are one way of doing this. Gibbs sampling (MCMC) is another way.

Look at some P(w|z) produced by LDA Show some pLSI and LDA results applied to text An LDA project by Tomasz Malisiewicz and Jonathan Huang An LDA project by Tomasz Malisiewicz and Jonathan Huang Search for the word ‘drive’

pLSA and LDA applied to Images How can one apply these techniques to the images?

Hoffman, 2001 Hierarchical Bayesian text models w N d z D w N c z D  Blei et al., 2001 Probabilistic Latent Semantic Analysis (pLSA) Latent Dirichlet Allocation (LDA)

w N d z D Hierarchical Bayesian text models Probabilistic Latent Semantic Analysis (pLSA) “face” Sivic et al. ICCV 2005

w N c z D  Hierarchical Bayesian text models Latent Dirichlet Allocation (LDA) Fei-Fei et al. ICCV 2005 “beach”

A Bayesian Hierarchical Model for Learning Natural Scene Categories

Flow Chart: Quick Overview

How to Generate an Image? Given scene generate an intermediate probability vector over ‘themes’ Determine current theme from mixture of themes Choose a scene (mountain, beach, …) For each word: Draw a codeword from that theme

Choose a category label c ~p(c|n) N: prior over scene category (multinomial) Choose pi ~ p(pi|c,theta) Pi is multinomial distribution over themes Theta is a CxK (#category x #themes) -Theta_k is k-dimensional dirichlet parameter condition on the category c For each of the N patches -Choose theme Zn ~ mult(pi) -Choose patch Xn ~ p(Xn|Zn,beta) -Beta is matrix of size KxT (#themes x #words)

How to Generate an Image?

Inference How to make decision on a novel image Integrate over latent variables to get: Approximate Variational Inference (not easy, but Gibbs sampling is supposed to be easier)

Codebook 174 Local Image Patches Detection: Evenly Sampled Grid Random Sampling Saliency Detector Lowe’s DoG Detector Representation: Normalized 11x11 gray values 128-dim SIFT

Results: Average performance 64% Confusion Matrix 100 training examples and 50 test examples Rank statistic test: the probability of a test scene correctly belong to one of the top N most probable categories

Results: The Distributions Theme distribution Codeword distribution

The peak at 174

Summary of detection and representation choices SIFT outperforms pixel gray values Sliding grid, which creates the largest number of patches, does best

Discovering objects and their location in images

Visual Words Vector Quantized SIFT descriptors computed in regions Regions come from elliptical shape adaptation around interest point, and from the maximally stable regions of Matas et al. Both are elliptical regions at twice their detected scale

Building a Vocabulary …

Vector quantization … Slide credit: Josef Sivic K-means clustering of 300K regions to get about 1K clusters for each of Shape Adapted and Maximally Stable regions

pLSA Training Sanity Check: Remember what quantities must be estimated?

Results #1: Topic Discovery This is just the training stage Obtain P(z k |d j ) for each image, then classify image as containing object k according to the max of P(z k |d j ) over k 4 object categories Plus background

Results #1: Topic Discovery

Results #2: Classifying New Images Object Categories learned on a corpus, then object categories found in new image Remember the index d in the graphical model Anybody remember how this is done?

How does pLSA deal with previously unseen documents? “Folding-in” Heuristic First train on Corpus to obtain Now re-run same training EM algorithm, but don’t re-estimate and let D={d unseen }

Results #2: Classifying New Images Train on one set and test on another

Results #3: Segmentation Localization and Segmentation of Object For a word occurrence in a particular document we can examine the probability of different topics Find words with P(z k |d j,w i ) >.8

Results #3: Segmentation Note: words shown are not the most probable words for a topic, but instead they are words that have a high probability of occurring in a topic AND high probability of occurring in the image

Results #3: Segmentation and Doublets Two class image dataset consisting of half the faces (218 images) and backgrounds (217 images) A 4 topic pLSA model is learned for all training faces and training backgrounds with 3 fixed background topics, i.e. one (face) topic is learned in addition to the three fixed background topics A doublet vocabulary is then formed from the top 100 visual words of the face topic. A second 4 topic pLSA model is then learned for the combined vocabulary of singlets and doublets with the background topics fixed.

Doublets Efros: didn’t work as much as you’d think Face Segmentation Scores Singleton:.49 Doublets:.61

Conclusions Showed how both papers use bag-of- words approaches We’re now ready to become experts on generative models like pLSA and LDA Graphical Model Fun! (Carlos Guestrin teaches Graphical Models)

Are you really into Graphical Models? Describing Visual Scenes using Transformed Dirichlet Processes. E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec. 2005. Describing Visual Scenes using Transformed Dirichlet Processes

References A Bayesian Hierarchical Model for Learning Natural Scene Categories, Fei Fei Li et al Describing Visual Scenes using Transformed Dirichlet Processes, Sudderth et al Discovering objects and their location in images, Sivic et al Latent Dirichlet Allocation, Blei et al Unsupervised Learning by Probabilistic Latent Semantic Analysis, T. Hoffman

A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images.

Similar presentations

Presentation on theme: "A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images.

Similar presentations

Presentation on theme: "A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images."— Presentation transcript:

Similar presentations

About project

Feedback