NICTA SML Seminar, May 26, 2011 Modeling spatial layout for image classification Jakob Verbeek 1 Joint work with Josip Krapac 1 & Frédéric Jurie 2 1: LEAR.

NICTA SML Seminar, May 26, 2011 Modeling spatial layout for image classification Jakob Verbeek 1 Joint work with Josip Krapac 1 & Frédéric Jurie 2 1: LEAR team, INRIA, Grenoble, France 2: GREY team, University of Caen, Caen, France

NICTA SML Seminar, May 26, 2011 Layout of this presentation Introduction ► Image classification ► Bag-of-word representation ► Modeling layout with spatial pyramid Fisher kernels to encode spatial layout ► Bag-of-word models for appearance ► Modeling both appearance and layout with Fisher kernel Experimental results ► 15 scenes data set ► Pascal VOC '07 Conclusion

NICTA SML Seminar, May 26, 2011 Image classification Goal: predict relevant class labels for an image ► Possibly multiple labels relevant ► Each label can be binary (car / no car), or multivalued (winter,spring,...) Applications include: ► keyword-based search on non-annotated images ► Image-based search for images with similar high-level content ►... Tree: yes (/no) Cat : no (/yes) Season: summer (/winter/sping/autumn)

NICTA SML Seminar, May 26, 2011 Bag-of-word image representation [Sivic et al. ICCV'03, Csurka et al. '04] Inspired from text classification ► Ignores structure of text ► Classifiers perform remarkably well using simple word counts Applying BoW to images: ► Basic elements are small image patches (eg sampled from multiscale grid) ► Compute (somewhat invariant) feature vector from each patch (eg SIFT) ► Assign feature vector to a “visual word” in a “visual dictionary” ► Compute visual word count histogram Obtaining visual words ► Cluster large set of patches off-line K-means, or mixture of Gaussian (MoG) ► Assign new patch to pre-trained clusters

NICTA SML Seminar, May 26, 2011 Examples of patches assigned to several clusters

NICTA SML Seminar, May 26, 2011 Encoding image layout using spatial pyramids [Lazebnik et al, CVPR '06] Standard bag-of-word representation ignores all structure Capture layout by counting each visual word in different spatial cells Images more similar when more matches in smaller cells Concatenate C histograms of K words to one KC long vector

NICTA SML Seminar, May 26, 2011 Need many spatial cells for a fine modeling of layout ► Given count, patches of a visual word can appear anywhere in spatial cell Histograms become increasingly sparse ► Number of patches sampled from image often 1k – 10k ► Visual vocabulary often of size 100 – 10k ► Patches assigned to a visual word tend to appear close in image ► Result: many cells will be empty, in practice few cells C < 20 The C counts per visual word are generally not optimally used. Can layout be coded more effectively ?! Spatial pyramid representation trade-off

NICTA SML Seminar, May 26, 2011 Fisher kernel framework [Jaakkola & Haussler, NIPS '99] Idea: use generative probabilistic models as feature extractors for discriminative models Train a generative model p(x) using maximum likelihood estimation Use the trained model to represent data with the Fisher Score vector Represents object x (sets, sequences,...) into a vector of same size as theta Define the Fisher Kernel as ► Where I is the Fisher Information Maxtrix ► K is invariant for re-parametrization

NICTA SML Seminar, May 26, 2011 Applying Fisher kernel framework to BoW models Each patch extracted from the image is characterized by the discrete visual word w it is mapped to and its location l in the image plain f=(w,l) Define generative model using ► multinomial distribution over visual words w Softmax formulation avoids constraints on parameters ► Gaussian density on location l given w The gradient of the average log-likelihood for the multinomial is ► Difference between frequency in image, and learned frequency ► “Shifted” bag-of-word histogram

NICTA SML Seminar, May 26, 2011 Applying Fisher kernel framework to BoW models For a single patch the gradient for the complete model is obtained as ► q nk indicator of visual word index ► l nk = l n - m k Codes 0 th, 1 st and 2 nd order statistics of patches assigned to visual word ► Same size as using 5 cells in spatial pyramid (image + 4 quadrants) ► All entries carry information

NICTA SML Seminar, May 26, 2011 Cartoon comparison with spatial pyramid

NICTA SML Seminar, May 26, 2011 Extension 1: Using a spatial mixture model Gaussian location model just extracts 1 st and 2 nd order moments To encode the spatial layout we define mixture of Gaussian (MoG) density Codes 0 th, 1 st and 2 nd order statistics of patches assigned to visual word ► As before, qnk indicator of visual word index ► Spatial soft-assign ► Moment per patch in spatial region

NICTA SML Seminar, May 26, 2011 Extension 2: Appearance Fisher score coding [Perronnin et al. CVPR'07] Same observations made for spatial layout also apply to appearance space ► Recall: appearance space quantized with k-means ► BoW histogram of image only captures count of patches in each cell ► No information of where in the cell the patch appearances are To encode appearance more precisely 1) Use more cells: sparser representation 2) Code more information per cell

NICTA SML Seminar, May 26, 2011 Extension 2: Spatial-Appearance Fisher score coding Generative model over pairs of ► Appearance feature vector ► Location of patch in image plane Gradients similar as before, only redefine q nk = p(w=k|x n,l n ) AppearanceLocation

NICTA SML Seminar, May 26, 2011 Recap of the different image representations discussed To encode spatial layout we considered ► Spatial pyramid ► Fisher score using MoG over location To encode appearance we considered ► BoW using k-means ► Fisher score using MoG over appearance What is the size of these image representations? ► K : # appearance clusters ► D : # appearance dimensions ► C : # spatial cells ► d : # spatial dimensions : 2

NICTA SML Seminar, May 26, 2011 Experimental evaluation Model training procedure: ► Learn appearance MoG on 500k SIFT vectors ► Shared “default” spatial MoG ► Training location model per word does not help Approximate normalization ► No analytical form for Fisher information matrix I for MoG ► Estimate I from sample estimate, but can be very large (200k x 200k~160Gb) ► Sample estimate of diagonal, whitens the representation Classifier training ► Kernelized multi-class logistic discriminant or binary SVM ► Intersection kernel on BoW+SPM representations ► Models with Fisher vectors use linear classifiers

NICTA SML Seminar, May 26, 2011 15 Scenes data set Vertical spatial structure: floor/ceiling, ground/sky Horizontal spatial structure: center/off-center

NICTA SML Seminar, May 26, 2011 Scene classification accuracy BoW appearance ► SPM benefits from more cells ► SFV better, and single cell suffices MoG for appearances ► Better than BoW, few visual words needed ► SFV with single cell as good as SPM with 5

NICTA SML Seminar, May 26, 2011 PASCAL Visual Object Category (VOC) Challenge 2007 Intra-class variations: position, scale, viewpoint, occlusion, deformation, “types” Spatial structure: “scene context”

NICTA SML Seminar, May 26, 2011 PASCAL Visual Object Category Challenge – mean Average Precision Similar conclusions ► SPM benefits from more cells ► SFV better for given # cells ► Single cell suffices for larger # visual words MoG for appearances ► Better than BoW ► SFV more compact than SPM 129k 27k 40k 10k

NICTA SML Seminar, May 26, 2011 PASCAL Visual Object Category (VOC) Challenge 2007 Spatial distribution of patches assigned to 5 most freq. visual word in the image More localized with large vocabularies: efficient for SFV, inefficient in SPM

NICTA SML Seminar, May 26, 2011 Conclusions Introduced using Fisher kernels to encode image layout ► Applies to BoW appearance representation ► Fisher kernel appearance coding Alternative to Spatial Pyramids: ► 1 st and 2 nd order spatial moments, instead of rigid spatial cells ► BoW: Similar size repres., can use linear instead of intersection kernel ► MoG: Similar performance, but more compact (roughly 5-10x) No additional overheads in training ► A “default” spatial model is as good as specifically trained ones Future work ► Using more advanced generative models ► Applications in video analysis

NICTA SML Seminar, May 26, 2011 Modeling spatial layout for image classification Jakob Verbeek 1 Joint work with Josip Krapac 1 & Frédéric Jurie 2 1: LEAR.

Similar presentations

Presentation on theme: "NICTA SML Seminar, May 26, 2011 Modeling spatial layout for image classification Jakob Verbeek 1 Joint work with Josip Krapac 1 & Frédéric Jurie 2 1: LEAR."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NICTA SML Seminar, May 26, 2011 Modeling spatial layout for image classification Jakob Verbeek 1 Joint work with Josip Krapac 1 & Frédéric Jurie 2 1: LEAR.

Similar presentations

Presentation on theme: "NICTA SML Seminar, May 26, 2011 Modeling spatial layout for image classification Jakob Verbeek 1 Joint work with Josip Krapac 1 & Frédéric Jurie 2 1: LEAR."— Presentation transcript:

Similar presentations

About project

Feedback