Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal: Discover visual object categories and their segmentation given a collection of unlabelled images Introduction Represent an image as a histogram of “visual words” The topic discovery models Probabilistic Latent Semantic Analysis (pLSA) [Hofmann’99] Experiment I: Caltech Dataset pLSA graphical model Five samples from a ‘motorbike’ visual word Improving localization using doublets 1 Oxford University 2 MIT 3 Carnegie Mellon University Experiment II: MIT dataset Overview Find topic vectors P(w|z) common to all documents and mixture coefficients P(z|d) specific to each document. Fit model by maximizing likelihood of data using EM. pLSA Model fitting: Assign each image to a topic with the highest P(z|d) Learn K = (5,6,7) topics Background is better modelled by multiple topics Pre-learning background topics on a separate bg dataset improves results Performance on novel images is comparable with weakly supervised method of [Fergus et al.’03] Confusion tables (K=5,6,7) learned topics Form a new vocabulary from pairs of locally co-occurring regions Doublet example IDoublet examle II Doublet segmentationSinglet segmentation 4 of the 10 learned topics shown by the 5 most probable images for each topic images, learn 10 topics Singlet segmentationAll detected visual words “Buildings”“Trees / Grass” “Bookshelves”“Computers” Example Images with multiple objects Image representation Approach: 1) Represent an image as a collection of visual words 2) Apply topic discovery models from statistical text analysis Results Histogram of visual words Detect affine covariant regions Represent each region by a SIFT descriptor Build visual vocabulary by k-means clustering (K~1,000) Assign each region to the nearest cluster centre Five samples from an ‘airplane’ visual word Mikolajczyk and Schmid’02, Schaffalitzky and Zisserman’02, Matas et al. ’02, Lowe’99, Sivic and Zisserman’03 Examples of visual words Doublet formation Segmentation For a given word w i in document d j examine posterior probability over topics. Faces Motorbikes Airplanes Cars Background I Background II Background III Visual words colour coded according to the topic with the highest probability Example motorbike segmentation Example airplane segmentation Image Classification Four object categories: faces, motorbikes, airplanes and cars rear (total of 3,190 images) and 900 background images LDA graphical model Latent Dirichlet Allocation (LDA) [Blei et al.’03] Treat multinomial weights over topics as random variables. Fit model using Gibbs sampling [Griffiths and Steyvers’04]. Results shown only for pLSA. LDA had very similar performance. Experiment III: Application to image retrieval Learn topic vectors on Caltech database Represent new query image in terms of learned topic vectors Retrieved images using visual word histograms Retrieved images using pLSA ‘object’ coefficients P(z|d) Example face segmentation Represent each keyframe using topic vectors learned on Caltech database Pretty Woman (6,641 keyframes) Retrieve images within Caltech database Query image pLSA Retrieve images in movie Pretty Woman Raw word histograms Precision – Recall plot Find visual words Form histograms Discover topics Visual Polysemy. Single visual word occurring on different (but locally similar) parts on different object categories. Visual Synonyms. Two different visual words representing a similar part of an object (wheel of a motorbike). w … visual words d … documents (images) z … topics (‘objects’) P(z|d) and P(w|z) are multinomial distributions CMU