Download presentation
Presentation is loading. Please wait.
Published byKelley Hawkins Modified over 8 years ago
1
CVC, June 4, 2012 Image categorization using Fisher kernels of non-iid image models Gokberk Cinbis, Jakob Verbeek and Cordelia Schmid LEAR team, INRIA, Grenoble, France To appear at CVPR June 2012
2
CVC, June 4, 2012 Can you guess what is behind the masked area ? Obviously yes, since image regions are far from i.i.d. Yet state-of-the-art image representations implicitly assume i.i.d. data
3
CVC, June 4, 2012 My goals for this talk Show that current image representations make iid assumptions, and that this is undesirable Present models that avoid such strong assumptions Show that the Fisher vectors of such models ► naturally incorporate discounting effects that are usually added in an ad- hoc manner, explaining why these have been found successful ► Lead to state-of-the-art image categorization performance
4
CVC, June 4, 2012 Fisher vector representation in a nutshell Proposed by Jaakkola & Haussler, NIPS '99 Use gradient signal of probabilistic model as data representation ► Motivated by the need to represent variably sized objects in a single vector space, such as sequences, sets, trees, graphs, … Used as feature vector for supervised methods such as classifiers Learn a (generative) probabilistic model from training data (offline) For new object x, compute gradient of data loglikelihood Normalization with inverse Fisher information F matrix ensures whitening of data and invariance for re-parametrization of the same probabilistic model Fisher vector
5
CVC, June 4, 2012 State-of-the-art image-representations that make iid assumption Bag of word histograms (BoW) ► Multinomial model ► Visual word indices are drawn from this multinomial ► Gradient of log-likelihood of indices in an image Fisher vectors for Mixture of Gaussians (MoG) ► Gaussian over feature space per visual word ► Local (sift) descriptors drawn from MoG ► Gradient of log-likelihood of descriptors in image
6
CVC, June 4, 2012 BoW image representation is FV of model with iid assumption Bag of word (BoW) image representation ► Extract local image descriptors ► Quantize into set of “visual word” indices using k-means ► Summarize image content by visual word frequency histogram Interpretation in terms of Fisher vector framework ► Visual word indices are iid draws from “universal” multinomial ► Gradient of log-likelihood of indices in an image
7
CVC, June 4, 2012 What's wrong with iid image representations ? Linear classification with BoW histograms: ► Each occurrence of a visual word index leads to same score increment ► Fisher vector over MoG: similar linear score change as in BoW model ► Classification score proportional to object size ! Retrieval ► Distances of form d(x,y) = f( |x-y| ) do not discount for small changes in large values: | 150 – 160 | = 10 = | 2 - 12 | ► Dot product scoring is linear given the query image, just like the linear classifier case
8
CVC, June 4, 2012 Common “trick” to boost performance of iid image representations Discounting of small changes in large values, limiting influence of burstiness ► Chi-square distance between vectors ► Hellinger distance: element-wise square-rooting State-of-the-art in combination with MoG Fisher vectors L2 Hellinger Chi-square
9
CVC, June 4, 2012 But how about Fisher vectors of non-iid models ? Standard BoW: Single universal multinomial governs all images ► Sample patches iid from the universal multinomial model Compound Dirichlet–multinomial model (a.k.a. Multivariate Pólya distribution) assumes there is a latent multinomial per image ► First, sample a multinomial image model from Dirichlet prior ► Then, sample each word iid from multinomial image model ► New hyper-parameter alpha ► Latent multinomial generates full dependency across patches in an image
10
CVC, June 4, 2012 Latent multinomial generates full dependency across patches After we observe many patches of road, sky, bike, …. We infer that multinomial is likely to assign high likelihood to such patches Therefore, we expect to see even more such patches in the rest of the image
11
CVC, June 4, 2012 But how about Fisher vectors of non-iid models ? BoW: Single universal multinomial governs all images ► Sample patches iid from the model Compound Dirichlet–multinomial model (a.k.a. Multivariate Pólya distribution) a ssumes there is a latent multinomial per image ► Sample a multinomial from Dirichlet prior ► Sample each word iid from multinomial ► New hyper-parameter alpha ► Latent multinomial generates full dependency across patches in an image ► Compute gradient of log-likelihood w.r.t. hyper-parameter
12
CVC, June 4, 2012 Gradient: transformations on counts Gradient of Pólya distribution given by di-gamma function of count + constant ► Small alpha > very sparse Dirichlet prior > monotone concave, like sqrt ► Large alpha > highly concentrated Dirichlet > linear, like BoW histogram
13
CVC, June 4, 2012 Fisher vector image-representations for Mixture of Gaussian model Fisher vectors for Mixture of Gaussians (MoG) [Perronnin & Dance, CVPR'07] ► Gaussian over feature space per visual word ► Local (SIFT) descriptors are iid draws from “universal” MoG ► State-of-the-art representation for image categorization (+sqrt transform) Gradient of log-likelihood of descriptors in image ► High-dimensional image descriptor: K(2D+1)
14
CVC, June 4, 2012 Latent mixture of Gaussian (MoG) model To remove iid assumption we proceed as before: ► Treat image-specific MoG model as latent variable ► Put priors on: mixing weights, variances, and means: Generative process per image ► Sample MoG parameters from prior distributions ► Sample descriptors iid from image-specific MoG
15
CVC, June 4, 2012 Latent mixture of Gaussian model For this model computation of likelihood and its gradient are intractable Learning is done using a Variational EM algorithm ► based on optimizing variational free-energy bound on the log-likelihood By constraining distribution q to have a certain independence structure tractable learning algorithms can be obtained We suggest to use the gradient of the bound as an approximate Fisher Vector ► In general, if bound is tight, then the exact Fisher vector is recovered ► Generates similar discounting effects as observed for latent BoW model Eg, for mixing weights same di-gamma function, now applied to soft-counts
16
CVC, June 4, 2012 Experimental evaluation on PASCAL VOC'07 benchmark
17
CVC, June 4, 2012 Experimental evaluation on image categorization task Data set: PASCAL VOC 2007 ► Images labeled for presence of 20 object categories Airplane, bicycle, boat, bus, car, cat, cow, dog, horse, motorbike, person, … ► 5000 images to train models, and 5000 images used for evaluation Performance measured in mean Average Precision over the 20 classes SIFT descriptors computed over dense multi-scale grid, PCA to 80 dim To incorporate spatial layout image representations computed over ► Complete image, 4 quadrants, 3 horizontal bands
18
CVC, June 4, 2012 Evaluation Bag-of-word models Comparing linear classifiers based on ► BoW histogram, sqrt of BoW histogram, latent BoW model Fisher Vector ► Varying vocabulary size, and use of spatial pyramid (SPM) Latent BoW model and sqrt tansform lead to comparable improvement
19
CVC, June 4, 2012 Evaluation Latent mixture of Gaussians model Comparing linear classifiers based on ► Fisher Vector of MoG model, sqrt of MoG FV, Latent MoG model FV ► Varying vocabulary size, and use of spatial pyramid (SPM) Latent MoG model and sqrt transform lead to comparable improvement State-of-the-art performance without including ad-hoc transformations SPM beaten by features ?!
20
CVC, June 4, 2012 Conclusions We propose non-iid models for image patches ► Treating parameters of conventional models as latent variables ► Use gradient with respect to hyper-parameters instead ► Corresponding Fisher Vectors naturally incorporate discounting effects that were previously applied in an ad-hoc manner (sqrt, chi-square) Our models explain why such transformations have proven successful, since they correspond to more realistic models that do not make iid assumptions We have shown that Variational Free-Energy bound can be used to successfully approximate Fisher Vectors of intractable models Same principle also applied to topic/aspect models (PLSA, LDA) which also leads to improved performance (in paper, not in presentation). We believe that the recipe: generative model + FV = image representation can be used to obtain better image representations by thinking about better models for, e.g., spatial layout, and co-occurrence among visual words
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.