Segmentation Driven Object Detection with Fisher Vectors

Segmentation Driven Object Detection with Fisher Vectors
Jakob Verbeek LEAR team, INRIA, Grenoble, France To appear at ICCV 2013, joint work with Gokberk Cinbis & Cordelia Schmid

Object detection Determine if and where in an image instances of an object category appear Training data: object instances given by bounding boxes Prediction: list of bounding boxes with detection confidence Example images with instances of the category “person”

Challenging factors Intra-class appearance variation
Deformable objects: e.g. animals Sub-categories: e.g. ferry vs yacht Scene composition Occlusions: e.g. tables and chairs Clutter: coincidental image content Imaging conditions viewpoint, scale, lighting conditions

Segmentation Driven Object Detection with Fisher Vectors
Fisher vector image representation State-of-the-art feature aggregation in video and image classification Recently used for object detection [Chen et al, CVPR'13], missing important non-linear normalizations of the FV Segmentation-based candidate windows [Van de Sande et al., ICCV'11] Feature extraction with approximate object masks Suppression of background clutter contained in windows Unsupervised and class-independent Experimental evaluation results

Fisher vector image representation
Using generative models as a feature extraction engine [Jaakkola, Haussler, NIPS 1999] Gradient of data log-likelihood as representation Maps arbitrary data types to finite dimensional vector space Gaussian mixture models for local image descriptors [Perronnin, Dance, CVPR 2007] State-of-the-art feature pooling for image/video classification/retrieval Offline: Train GMM on large collection of local features, e.g. SIFT Representation: gradient of log-likelihood of descriptors in given image High dimensionality: 2KD when skipping gradient w.r.t. mixing weights 𝑝 𝑥 = 𝑘=1 𝐾 π 𝑘 𝑁 𝑥; μ 𝑘 , σ 𝑘 ∇ μ 𝑘 ln𝑝 𝑥 1:𝑁 = 1 π 𝑘 𝑛=1 𝑁 𝑝 𝑘∣ 𝑥 𝑛 𝑥 𝑛 − μ 𝑘 σ 𝑘 ∇ σ 𝑘 ln𝑝 𝑥 1:𝑁 = π 𝑘 𝑛=1 𝑁 𝑝 𝑘∣ 𝑥 𝑛 𝑥 𝑛 − μ 𝑘 σ 𝑘 2 −1

Illustration of gradient w.r.t. means of Gaussians

Normalization of the Fisher vector
Inverse Fisher information matrix F Renders dot-product invariant for re-parametrization Linear projection, L analytically approximated as diagonal [Jaakkola, Haussler, NIPS 1999] Power-normalization Renders Fisher vector less sparse Corrects over-counting due to independence assumption on local features [Perronnin, Sanchez, Mensink, ECCV'10], [Cinbis, Verbeek, Schmid, CVPR '12] L2-normalization Makes representation invariant to number of local features Among other Lp norms the most effective with linear classifier [Sanchez, Perronnin, Mensink, Verbeek IJCV'13] 𝑥 𝑇 𝐹 −1 𝑦 𝑥 =𝐿𝑥 𝑥 =𝑠𝑖𝑔𝑛 𝑥 ∣𝑥∣ ρ 0<ρ<1 𝑥 = 𝑥 𝑥 𝑇 𝑥

Overview of this talk Fisher vector image representation
Segmentation-based candidate windows Feature extraction with approximate object masks Experimental evaluation results

A typical object detection system
Training a binary classifier that will score object hypotheses Positives given by manual annotation (hundreds to thousands) Negatives progressively sampled outside positive boxes Repetitive access to negative windows to find the hard ones Store or re-extract feature vectors of these examples Applying the detector on test image Evaluate classifier on a collection of windows Non-maximum suppression Detection speed proportional to number of considered windows

Fisher vectors do not mix with a typical detection system
A very simple Fisher vector window descriptor Features dimensionality D=64, GMM with K=64 Gaussians FV dimensionality = 2KD = 2^13 = 8K, 4 byte floating points = 32 KB memory per window Number of possible windows in an image with N pixels is O(N^2) Using “just” 1 million windows yields 32 Giga Bytes per image 100x100 spatial grid, 10 scales, 10 aspect ratios Bottom line: Unfeasible to store, inefficient to re-extract, costly to score Let's reduce the amount of data to store and process by a factor Compress the Fisher vectors, decompress for learning or scoring Product Quantization (32x, lossy) + Blosc compression (4x, lossless) [Jégou, Douze, Schmid, PAMI 2011] [Alted, Comp. Sc. & Eng, 2010] Restrict attention to a much smaller subset of windows

Alternatives to exhaustive sliding window search
Branch-and-bound techniques Imposes requirements on type of classifiers / features [Lampert, Blaschko, Hofmann, PAMI 2009] Feature cascades Requires set of fast features in early stages [Viola & Jones, IJCV 2004] Coarse-to-fine search Requires compositionality of classifier score [Felzenszwalb, Girshick, McAllester, CVPR 2010] Data driven generic object hypotheses Consider boxes aligned with low-level image contours Does not impose constraints on classifiers / features [Alexe, Deselaers, Ferrari, CVPR 2010]

Segmentation to propose object hypotheses [Sande et al, ICCV 2011]
Object hypotheses encouraged to align with texture and color contours Segment image into super-pixels using low-level color cues Hierarchical grouping similar neighboring regions Each node in tree generates hypothesis from its bounding box “Never trust segmentation”: vary the scale and color parameters (8 trees) With around 1500 windows per image >95% of objects are captured PASCAL VOC 2007 evaluation: object ok if intersection/union > .5 with a box

Segmentation-based candidate windows Feature extraction with approximate object masks Experimental evaluation

Can we do more with the superpixels ?
Suppress background clutter by segmenting the object Does the “generating segment” isolate the object ? No: generally large parts of object are missing Missing object regions appear with clutter too late in hierarchy Hierarchy good for bounding boxes, not for segmentation

Grouping non-straddling superpixels
Suppress background clutter by segmenting the object Background likely to appear across the window boundary Suppressing super-pixels that extend outside the bounding box Overlay the binary masks obtained using different segmentation maps

Masked Fisher vectors Weight local features by segmentation mask
Mask is obtained a negligible cost since superpixels already extracted ∇ μ 𝑘 ln𝑝 𝑥 1:𝑁 = 1 π 𝑘 𝑛=1 𝑁 𝑚 𝑛 𝑝 𝑘∣ 𝑥 𝑛 𝑥 𝑛 − μ 𝑘 σ 𝑘

Masked Fisher vectors Weight local features by segmentation mask
Mask is obtained a negligible cost since superpixels already extracted Weight local features by segmentation mask ∇ μ 𝑘 ln𝑝 𝑥 1:𝑁 = 1 π 𝑘 𝑛=1 𝑁 𝑚 𝑛 𝑝 𝑘∣ 𝑥 𝑛 𝑥 𝑛 − μ 𝑘 σ 𝑘

What happens on incorrect object hypotheses ?
Important, since more than 99% of object hypotheses are incorrect Cropped objects often largely suppressed: horse (a), car (c) Object features can dominate even if box is wrong: bus (b), car (c) Concatenate masked and full window descriptors (a) (b) (c)

Detection system overview
Segmentation image analysis candidate windows masks

Segmentation image analysis candidate windows, masks Local features SIFT and Color (RBG mean and variance on 4x4 grid on patch) Aggregate features into FV over window and mask Compute FVs on full window and 4x4 grid on window (SIFT only) Vocabulary size K=64, and both features compressed to D=64 by PCA 8192 dimensions dimensions = 1.2 MB storage ~> 9KB

Segmentation image analysis candidate windows, masks Local features SIFT and Color (RBG mean and variance on 4x4 grid on patch) Aggregate features into FV over window and mask Compute Fvs on full window and 4x4 grid on window (SIFT only) Vocabulary size K=64, and both features compressed to D=64 by PCA Global features Compute Fisher vectors from all local descriptors in full image No spatial grid used here Inter-category contextual re-scoring [Felzenszwalb et al., PAMI 2010] Use score and location of max detection for all classes as features 𝑥= 𝑥 𝑤𝑖𝑛𝑑𝑜𝑤 𝑥 𝑖𝑚𝑎𝑔𝑒

Segmentation-based candidate windows Feature extraction with approximate object masks Experimental evaluation

Setup of evaluation Standard evaluation using PASCAL VOC data sets
2007 and 2010 editions: 20 classes, 5k and 10k train & test images resp. subset of train images of 2007 set used for development Only bounding box annotations used Viewpoint annotations ignored Single linear SVM classifier trained for each category Liblinear for training, decompressing descriptors on-the-fly as needed Detection speed: 20 min for 20 classes on 5k images on 35 cores (126 GB) About 0.4 sec for 1 class for 1 image on 1 core (25 MB) Similar to speed of fast DPM [Felzenszwalb, Girshick, McAllester, CVPR 2010] In both cases time for feature extraction excluded

Results on development set
Mask features outperform those of generating segment: +8.3% mAP Complementary with full window feature: +1.4% mAP

Results on VOC 2007 Mask feature leads to about 2% absolute increase in mAP Consistent improvement using various sets of base features S: SIFT C: Color W: window M: mask F: full image C: contextual re-scoring

Results on VOC 2007 Mask feature leads to about 2% absolute increase in mAP Consistent improvement using various sets of base features Global features both effective Full image descriptor brings about 1% mAP Inter-category context brings about 2% mAP S: SIFT C: Color W: window M: mask F: full image C: contextual re-scoring

Comparison to state-of-the-art results on VOC 2007
Improvements over the current state of the art on 8 of the 20 classes In mAP both without (+3.7%) inter-category context and with (+1.8%) Results of Van de Sande et al obtained using same object hypotheses Improvements of 4.8% without and 6.8% mAP with inter-category context

Improved top detection per image when using mask
Full system used with (bottom) and without (top) mask features Mask suppresses cropped bus Mask handles poor alignment box w.r.t. object Mask suppresses clutter

Deteriorated top detection per image when using mask
Background suppression seems to introduce bias towards large windows Multiple objects detected Too large detection windows

Improvements over the current state of the art on 10 of the 20 classes

Improvements over the current state of the art on 11 of the 20 classes In mAP both with (+1.6%) inter-category context and without (+1.7%)

Comparison with deformable part-based models (DPM), differences > 5% AP DPM better for : bottle, chair, person Ours better for : aeroplane, bird, boat, cat, cow, table, dog, sheep, sofa, tv Deformation might not distinguishing feature to prefer DPM More rigid HOG features (+some deformation) vs locally orderless FV Multiple components ? Head & shoulder vs full body, empty vs occupied chair ?

Conclusion We presented a state of the art object detection system, novelties: Fisher vector coding to aggregate local features Approximate object masks to weight local features Leveraging of segmentation at two levels Object hypotheses generation Background suppression in features Excellent performance measured on PASCAL VOC 2007 and 2010 datasets Improves current state-of-the-art results Detection speed comparable to cascaded DMP implementation

Some more detections...

Some more masks...

Segmentation Driven Object Detection with Fisher Vectors

Similar presentations

Presentation on theme: "Segmentation Driven Object Detection with Fisher Vectors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Segmentation Driven Object Detection with Fisher Vectors

Similar presentations

Presentation on theme: "Segmentation Driven Object Detection with Fisher Vectors"— Presentation transcript:

Similar presentations

About project

Feedback