A coarse-to-fine approach for fast deformable object detection Marco Pedersoli Andrea Vedaldi Jordi Gonzàlez.

A coarse-to-fine approach for fast deformable object detection Marco Pedersoli Andrea Vedaldi Jordi Gonzàlez

[Fischler Elschlager 1973] Object detection 22 Addressing the computational bottleneck - branch-and-bound [Blaschko Lampert 08, Lehmann et al. 09] - cascades [Viola Jones 01, Vedaldi et al. 09, Felzenszwalb et al 10, Weiss Taskar 10] - jumping windows [Chum 07] - sampling windows [Gualdi et al. 10] - coarse-to-fine [Fleuret German 01, Zhang et al 07, Pedersoli et al. 10] [Felzenszwalb et al 08] [Vedaldi Zisserman 2009] [Zhu et al 10] [VOC 2010]

Analysis of the cost of pictorial structures 3

cost of inference - one part: L - two parts: L 2 -…-… - P parts: L P with a tree - using dynamic programming - PL 2 - Polynomial, but still too slow in practice with a tree and quadratic springs - using the distance transform [Felzenszwalb and Huttenlocher 05] - PL - In principle, millions of times faster than dynamic programming ! The cost of pictorial structures 44 L = number of part locations ~ number of pixels ~ millions

Deformable part model [Felzenszwalb et al. 08] - locations are discrete - deformations are bounded 55 δ image number of possible part locations: L L / δ 2 L2L2 LC, C << L cost of placing two parts: C = max. deformation size total geometric cost: C PL / δ 2 A notable case: deformable part models 5

With deformable part models - finding the optimal parts configuration is cheap - distance transform speed-up is limited Standard analysis does not account for filtering: Typical example - filter size: F = 6 × 6 × 32 - deformation size: C = 6 × 6 Filtering dominates the finding the optimal part configuration! 6 C PL / δ 2 image F = size of filter filtering cost: F PL / δ 2 geometric cost: total cost: (F + C) PL / δ 2

Accelerating deformable part models Cascade of deformable parts [Felzenszwalb et al. 2010] - detect parts sequentially - stop when confidence below a threshold Coarse-to-fine localization [Pedersoli et al. 2010] - multi-resolution search - we extend this idea to deformable part models 7 the key is reducing the filter evaluations deformable part model cost: (F + C) PL / δ 2

Our contribution: Coarse-to-fine for deformable models 8

Our model Multi-resolution deformable parts - each part is a HOG filter - recursive arrangement - resolution doubles - bounded deformation Score of a configuration S(y) - HOG filter score - parent-child deformation score 9 image

Coarse-to-Fine search 10

Quantify the saving 1D view (circle = part location) 2D view L 4L 16L exact L L L CTF # filter evaluations overall speedup 4 R exponentially larger saving 11

Lateral constraints Geometry in deformable part models is cheap - can afford additional constraints Lateral constraints - connect sibling parts Inference - use dynamic programming within each level - open the cycle by conditioning one node 12

Why are lateral constraints useful? Encourage consistent local deformations - without lateral constraints siblings move independently - no way to make their motion coherent Lateral constraints without lateral constraints y and y’ have the same geometric cost with lateral constraints y can be encouraged 13

Experiments 14

Effect of deformation size INRIA pedestrian dataset - C = deformation size (HOG cells) - AP = average precision (%) - Coarse-to-fine (CTF) inference Remarks - large C slows down inference but does not improve precision - small C implies already substantial part deformation due to multiple resolutions C3×35×57×7 AP83.583.283.6 time0.33s2.0s9.3s 15

Effect of the lateral constraints Exact vs Coarse-to-fine (CTF) inference CTF ~ exact inference scores - CTF ≤ exact - bound is tighter with lateral constraints Effect is significant on training as well - additional coherence avoids spurious solutions - Example learning the head model Big improvement with coarse-to-fine search - Example: learning the head model Effect on the inference scores CTF learning and tree CTF learning and tree + lat. inferenceexact inferenceCTF inference tree83.0 AP80.7 AP tree + lateral conn.83.4 AP83.5 AP treetree + lat. CTF score exact score 16

Training speed Structured latent SVM [Felzenszwalb et al. 08, Vedaldi et al. 09] - deformations of training objects are unknown - estimated as latent variables Algorithm - Initialization: no negative examples, no deformations - Outer loop ▪ Inner loop Collect hard negative examples (CTF inference) Learn the model parameters (SGD) ▪ Estimate the deformations (CTF inference) The training speed is dominated by the cost of inference! 17 timetrainingtesting exact inference≈20h2h ( 10s per image) CTF inference≈2h4m (0.33s per image) > 10× speedup!

PASCAL VOC 2007 Evaluate on the detection of 20 different object categories - ~5,000 images for training, ~5,000 images for testing Remarks - very good for aeroplane, bicycle, boat, table, horse, motorbike, sheep - less good for bottle, sofa, tv Speed-accuracy trade-off - time is drastically reduced - hit on AP is small 18

Comparison to the cascade of parts Cascade of parts [Felzenszwalb et al. 10] - test parts sequentially, reject when score falls below threshold - saving at unpromising locations (content dependent) - difficult to use in training (thresholds must be learned) Coarse-to-fine inference - saving is uniform (content independent) - can be used during training 19

Coarse-to-fine cascade of parts Cascade and CTF use orthogonal principles - easily combined - speed-up multiplies! Example - apply a threshold at the root - plot AP vs speed-up - In some cases 100 x speed-up can be achieved 20 CTF cascade score > τ 1 ? reject cascade score > τ 2 ? reject CTF

Summary Analysis of deformable part models - filtering dominates the geometric configuration cost - speed-up requires reducing filtering Coarse-to-fine search for deformable models - lower resolutions can drive the search at higher resolutions - lateral constraints add coherence to the search - exponential saving independent of the image content - can be used for training too Practical results - 10x speed-up on VOC and INRIA with minimum AP loss - can be combined with cascade of parts for multiplied speedup Future - More complex models with rotation, foreshortening, … 21

Thank you!

Coarse-to-Fine 24 L(F+C)L(4F+C)L(16F+C) L(4 r F+C)≈4 r LF Computational Cost: For resolution r : Each level 4X Total for R levels: Speed-up: 13X LF(4 R -1)/15

Cost of a deformable template 25 Cost of matching one part: L Cost of matching two parts: L2L2

Real cost of a deformable template 26 in modern detectors few real locations L’ quantization: L’=L/(8 x 8) bounded deformation D D=w x h<<L’ high filter dimension F F = dim( ) = w x h x d 8 8 h w therefore F>>D new matching cost: Without DT  PL’(F+D) ≈ PL’F  with DT The dominant cost of matching is filtering and not the cost of finding parts configurations!

Coarse-to-fine search in 1D f(x) D minimum distance between two minima (in images objects overlapping) local neighborhoods |N| < D CtF search for each N find the local minima of f(x) D |N|

Coarse-to-Fine 28 L(F+C)4L(4F+C)16L(16F+C) C=3

* * * whLFD (4x4)whLFD (16x16)whLFD r=0 r=1 r=2 whLFD 4whLFD 16whLFD Complete search Coarse-to-Fine h w Computational Cost Complete search vs. CtF inference Computational cost reduced of 4X per resolution R=3 constant speed-up of 12X!

The actual cost of matching 2/2 Speed ∝ size of filter F and deformation C Typical example - filter size: F = 6 × 6 × 32 - deformation size: C = 6 × 6 The dominant cost of matching is filtering and not the cost of finding parts configurations! Distance transform is not the answer anymore 30 (C + F) PL / δ 2 deformable parts model pictorial structure PL 2 (or PL)

Our model AppearanceStructure Recursive model: - each part is a vector of weights associated to HOG features - increasing resolution, each part is decomposed into 4 subparts Deformation: - Bounded to a fix number of HOGs - Varies with resolution Score

Coarse-to-fine search – 1D Consider searching along the x-axis only Lowest resolution (root) - Filter at all L locations - propagate only local maxima Medium resolution - filter at only L / 3 x 3 = L locations (out of 2L) - propagate only local maxima High resolution - filter at only L / 3 x 3 = L locations (out of 4L) number of filter evaluations: L + (2L) + (4L) L + L + L low res med res hi res L 2L 4L (exponential reduction)

Coarse-to-Fine on an image 33 L(4F+C) L(F+C)L(16F+C) 4 r L(4 r F+C)≈16 r LF Computational Cost: For resolution r : Each level 4X Total for R levels: Total Speed-up: 13X LF(16 R -1)/15 L(4 r F+C)≈4 r LF LF(4 R -1)/3 standard CtF standard CtF

Comparison to cascade of parts Cascade of parts: - prunes hypotheses based only on global score: sum of the previously detected parts > t ? - not considers any spatial information Coarse-to-Fine search - prunes hypotheses based on the relative scores: Maximum over a set of spatially close hypotheses - not considers the global score Cascade and Coarse-to-Fine use different cues, therefore the hypotheses pruned by one method are not pruned by the other and vice-versa The methods are orthogonal each other therefore the combination of the two should provide further advantages! … >t 0 >t 1

Coarse-to-Fine + Casacade Simplified cascade on Coarse-to-Fine inference: - single threshold T when moving from one resolution to the next one Plot trade-off Speed vs. AP Varying T on VOC07 100X speed-up for certain classes Cascade scr >T Speed-up = #HOG CtF+Cascade #HOG Complete Srch CtF search Cascade scr >T CtF search

Summary The sweet spot of deformable part models - geometiCoarse-to-Fine inference together with Hierarchical Multiresolution part- base model More than 10X constant speed-up No need learning thresholds on validation data 10X speed-up in training when estimating latent variables Coarse-to-Fine inference is orthogonal to Cascade Using both methods 100X speed-up with little loss in performance Search over rotations, foreshortening, appearances Faster HOG computation More complex structure: fully connected deformations More complex models: 3D representation?!

Coarse-to-fine cascade of parts Cascade and CTF use orthogonal principles - easily combined (speed-up multiply!) Example - apply a threshold at the root - plot AP vs speed-up as a function of τ root - In some cases 100 x speed-up can be achieved 37 CTF cascade score > τ 1 ? reject cascade score > τ 2 ? reject CTF

A coarse-to-fine approach for fast deformable object detection Marco Pedersoli Andrea Vedaldi Jordi Gonzàlez.

Similar presentations

Presentation on theme: "A coarse-to-fine approach for fast deformable object detection Marco Pedersoli Andrea Vedaldi Jordi Gonzàlez."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A coarse-to-fine approach for fast deformable object detection Marco Pedersoli Andrea Vedaldi Jordi Gonzàlez.

Similar presentations

Presentation on theme: "A coarse-to-fine approach for fast deformable object detection Marco Pedersoli Andrea Vedaldi Jordi Gonzàlez."— Presentation transcript:

Similar presentations

About project

Feedback