Mixtures of Gaussians and Advanced Feature Encoding

Slides:



Advertisements
Similar presentations
Weakly supervised learning of MRF models for image region labeling Jakob Verbeek LEAR team, INRIA Rhône-Alpes.
Advertisements

Improving the Fisher Kernel for Large-Scale Image Classification Florent Perronnin, Jorge Sanchez, and Thomas Mensink, ECCV 2010 VGG reading group, January.
Aggregating local image descriptors into compact codes
Three things everyone should know to improve object retrieval
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Limin Wang, Yu Qiao, and Xiaoou Tang
Clustering with k-means and mixture of Gaussian densities Jakob Verbeek December 3, 2010 Course website:
Multi-layer Orthogonal Codebook for Image Classification Presented by Xia Li.
MIT CSAIL Vision interfaces Approximate Correspondences in High Dimensions Kristen Grauman* Trevor Darrell MIT CSAIL (*) UT Austin…
1 Part 1: Classical Image Classification Methods Kai Yu Dept. of Media Analytics NEC Laboratories America Andrew Ng Computer Science Dept. Stanford University.
Computer Vision – Image Representation (Histograms)
CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.
Special Topic on Image Retrieval Local Feature Matching Verification.
Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.
Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu ∗, Qifa Ke, Michael Isard, and Jian Sun CVPR 2009.
Bag of Features Approach: recent work, using geometric information.
Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.
Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.
Lecture 28: Bag-of-words models
Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Mixtures of Gaussians and Advanced Feature Encoding Computer Vision CS 143, Brown James Hays Many slides from Derek Hoiem, Florent Perronnin, and Hervé.
Spatial Pyramid Pooling in Deep Convolutional
Student: Kylie Gorman Mentor: Yang Zhang COLOR-ATTRIBUTES- RELATED IMAGE RETRIEVAL.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Global and Efficient Self-Similarity for Object Classification and Detection CVPR 2010 Thomas Deselaers and Vittorio Ferrari.
Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Indexing Techniques Mei-Chen Yeh.
Exercise Session 10 – Image Categorization
Computer Vision James Hays, Brown
Keypoint-based Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/04/10.
The EM algorithm, and Fisher vector image representation
Action recognition with improved trajectories
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Building local part models for category-level recognition C. Schmid, INRIA Grenoble Joint work with G. Dorko, S. Lazebnik, J. Ponce.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Svetlana Lazebnik, Cordelia Schmid, Jean Ponce
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.
Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Jakob Verbeek December 11, 2009
Lecture 08 27/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
CS654: Digital Image Analysis
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
From Dictionary of Visual Words to Subspaces: Locality-constrained Affine Subspace Coding (LASC) Peihua Li, Xiaoxiao Lu, Qilong Wang Presented by Peihua.
Class 3: Feature Coding and Pooling Liangliang Cao, Feb 7, 2013 EECS Topics in Information Processing Spring 2013, Columbia University
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
January, 7th 2011Simon Giraudot & Ugo Martin - MoSIG - Machine learning1/1/ Large-Scale Image Retrieval with Compressed Fisher Vectors Presentation of.
CVC, June 4, 2012 Image categorization using Fisher kernels of non-iid image models Gokberk Cinbis, Jakob Verbeek and Cordelia Schmid LEAR team, INRIA,
Machine Learning and Category Representation Jakob Verbeek November 25, 2011 Course website:
NICTA SML Seminar, May 26, 2011 Modeling spatial layout for image classification Jakob Verbeek 1 Joint work with Josip Krapac 1 & Frédéric Jurie 2 1: LEAR.
Learning Mid-Level Features For Recognition
Segmentation Driven Object Detection with Fisher Vectors
CS 2750: Machine Learning Dimensionality Reduction
Face Recognition and Feature Subspaces
Video Google: Text Retrieval Approach to Object Matching in Videos
Recap: Advanced Feature Encoding
Large-scale Instance Retrieval
Training Techniques for Deep Neural Networks
By Suren Manvelyan, Crocodile (nile crocodile?) By Suren Manvelyan,
CS 1674: Intro to Computer Vision Scene Recognition
Brief Review of Recognition + Context
Video Google: Text Retrieval Approach to Object Matching in Videos
Presentation transcript:

Mixtures of Gaussians and Advanced Feature Encoding Computer Vision Ali Borji UWM Many slides from James Hayes, Derek Hoiem, Florent Perronnin, and Hervé Jégou

Why do good recognition systems go bad? E.g. Why isn’t our Bag of Words classifier at 90% instead of 70%? Training Data Huge issue, but not necessarily a variable you can manipulate. Learning method Probably not such a big issue, unless you’re learning the representation (e.g. deep learning). Representation Are the local features themselves lossy? What about feature quantization? That’s VERY lossy.

Standard Kmeans Bag of Words http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

Today’s Class More advanced quantization / encoding methods that represent the state-of-the-art in image classification and image retrieval. Soft assignment (a.k.a. Kernel Codebook) VLAD Fisher Vector Mixtures of Gaussians

Motivation Bag of Visual Words is only about counting the number of local descriptors assigned to each Voronoi region Why not including other statistics? http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

We already looked at the Spatial Pyramid level 2 level 0 level 1 But today we’re not talking about ways to preserve spatial information.

Motivation Bag of Visual Words is only about counting the number of local descriptors assigned to each Voronoi region Why not including other statistics? For instance: mean of local descriptors http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

Motivation Bag of Visual Words is only about counting the number of local descriptors assigned to each Voronoi region Why not including other statistics? For instance: mean of local descriptors (co)variance of local descriptors http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

Simple case: Soft Assignment Called “Kernel codebook encoding” by Chatfield et al. 2011. Cast a weighted vote into the most similar clusters.

Simple case: Soft Assignment Called “Kernel codebook encoding” by Chatfield et al. 2011. Cast a weighted vote into the most similar clusters. This is fast and easy to implement (try it for HW 5!) but it does have some downsides for image retrieval – the inverted file index becomes less sparse.

A first example: the VLAD ① assign descriptors μ 2 μ1 Given a codebook , e.g. learned with K-means, and a set of local descriptors : ① assign: ②③ compute: concatenate vi’s + normalize μ 3 μ 4 x μ 5 ② compute x- μ i Intuition – you’re encoding the updates to each centroid if you were running the kmeans algorithm. ③ vi=sum x- μ i for cell i v5 v1 v2 v3 v4 Jégou, Douze, Schmid and Pérez, “Aggregating local descriptors into a compact image representation”, CVPR’10.

A first example: the VLAD A graphical representation of K=16 centroids / clusters. Blue is positive red is negative. SIFT layout. Jégou, Douze, Schmid and Pérez, “Aggregating local descriptors into a compact image representation”, CVPR’10.

The Fisher vector Score function Given a likelihood function with parameters λ, the score function of a given sample X is given by: → Fixed-length vector whose dimensionality depends only on # parameters. Intuition: direction in which the parameters λ of the model should we modified to better fit the data.

Aside: Mixture of Gaussians (GMM) For Fisher Vector image representations, is a GMM. GMM can be thought of as “soft” kmeans. Each component has a mean and a standard deviation along each direction (or full covariance) 0.5 0.4 0.05 But how did we find the Gaussian Mixture model for our feature encoding? Very similar to kmeans.

This looks like a soft version of kmeans!

The Fisher vector Relationship with the BOW FV formulas: 0.5 0.4 0.05 Perronnin and Dance, “Fisher kernels on visual categories for image categorization”, CVPR’07.

The Fisher vector Relationship with the BOW FV formulas: gradient wrt to w ≈ → soft BOW = soft-assignment of patch t to Gaussian i 0.5 0.4 0.05 Perronnin and Dance, “Fisher kernels on visual categories for image categorization”, CVPR’07.

The Fisher vector Relationship with the BOW FV formulas: gradient wrt to w ≈ → soft BOW = soft-assignment of patch t to Gaussian i → compared to BOW, include higher-order statistics (up to order 2) Let us denote: D = feature dim, N = # Gaussians BOW = N-dim FV = 2DN-dim 0.5 0.4 0.05 gradient wrt to μ and σ Perronnin and Dance, “Fisher kernels on visual categories for image categorization”, CVPR’07.

The Fisher vector Relationship with the BOW FV formulas: gradient wrt to w ≈ → soft BOW = soft-assignment of patch t to Gaussian i → compared to BOW, include higher-order statistics (up to order 2) FV much higher-dim than BOW for a given visual vocabulary size FV much faster to compute than BOW for a given feature dim 0.5 0.4 0.05 gradient wrt to μ and σ Perronnin and Dance, “Fisher kernels on visual categories for image categorization”, CVPR’07.

The Fisher vector Dimensionality reduction on local descriptors Perform PCA on local descriptors: → uncorrelated features are more consistent with diagonal assumption of covariance matrices in GMM → FK performs whitening and enhances low-energy (possibly noisy) dimensions

The Fisher vector Normalization: variance stabilization → Variance stabilizing transforms of the form: (with α=0.5 by default) can be used on the FV (or the VLAD). → Reduce impact of bursty visual elements Jégou, Douze, Schmid, “On the burstiness of visual elements”, ICCV’09.

Datasets for image retrieval INRIA Holidays dataset: 1491 shots of personal Holiday snapshot 500 queries, each associated with a small number of results 1-11 results 1 million distracting images (with some “false” positives) Hervé Jégou, Matthijs Douze and Cordelia Schmid Hamming Embedding and Weak Geometric consistency for large-scale image search, ECCV'08

Examples Retrieval Example on Holidays: From: Jégou, Perronnin, Douze, Sánchez, Pérez and Schmid, “Aggregating local descriptors into compact codes”, TPAMI’11. 64 feature dimensions

Examples Retrieval Example on Holidays: second order statistics are not essential for retrieval From: Jégou, Perronnin, Douze, Sánchez, Pérez and Schmid, “Aggregating local descriptors into compact codes”, TPAMI’11.

Examples Retrieval Example on Holidays: second order statistics are not essential for retrieval even for the same feature dim, the FV/VLAD can beat the BOW From: Jégou, Perronnin, Douze, Sánchez, Pérez and Schmid, “Aggregating local descriptors into compact codes”, TPAMI’11.

Examples Retrieval Example on Holidays: second order statistics are not essential for retrieval even for the same feature dim, the FV/VLAD can beat the BOW soft assignment + whitening of FV helps when number of Gaussians ↑ From: Jégou, Perronnin, Douze, Sánchez, Pérez and Schmid, “Aggregating local descriptors into compact codes”, TPAMI’11.

Examples Retrieval Example on Holidays: second order statistics are not essential for retrieval even for the same feature dim, the FV/VLAD can beat the BOW soft assignment + whitening of FV helps when number of Gaussians ↑ after dim-reduction however, the FV and VLAD perform similarly From: Jégou, Perronnin, Douze, Sánchez, Pérez and Schmid, “Aggregating local descriptors into compact codes”, TPAMI’11.

Examples Classification Example on PASCAL VOC 2007: From: Chatfield, Lempitsky, Vedaldi and Zisserman, “The devil is in the details: an evaluation of recent feature encoding methods”, BMVC’11. Feature dim mAP VQ 25K 55.30 KCB 56.26 LLC 57.27 SV 41K 58.16 FV 132K 61.69

Examples Classification Example on PASCAL VOC 2007: → FV outperforms BOW-based techniques including: VQ: plain vanilla BOW KCB: BOW with soft assignment LLC: BOW with sparse coding From: Chatfield, Lempitsky, Vedaldi and Zisserman, “The devil is in the details: an evaluation of recent feature encoding methods”, BMVC’11. Feature dim mAP VQ 25K 55.30 KCB 56.26 LLC 57.27 SV 41K 58.16 FV 132K 61.69

Examples Classification Example on PASCAL VOC 2007: → FV outperforms BOW-based techniques including: VQ: plain vanilla BOW KCB: BOW with soft assignment LLC: BOW with sparse coding → including 2nd order information is important for classification From: Chatfield, Lempitsky, Vedaldi and Zisserman, “The devil is in the details: an evaluation of recent feature encoding methods”, BMVC’11. Feature dim mAP VQ 25K 55.30 KCB 56.26 LLC 57.27 SV 41K 58.16 FV 132K 61.69

Packages The INRIA package: http://lear.inrialpes.fr/src/inria_fisher/ The Oxford package: http://www.robots.ox.ac.uk/~vgg/research/encoding_eval/ Vlfeat does it too! http://www.vlfeat.org

Summary We’ve looked at methods to better characterize the distribution of visual words in an image: Soft assignment (a.k.a. Kernel Codebook) VLAD Fisher Vector Mixtures of Gaussians is conceptually a soft form of kmeans which can better model the data distribution.