OBJ CUT & Pose Cut CVPR 05 ECCV 06 UNIVERSITY OF OXFORD OBJ CUT & Pose Cut CVPR 05 ECCV 06 Philip Torr M. Pawan Kumar, Pushmeet Kohli and Andrew Zisserman
Conclusion Combining pose inference and segmentation worth investigating. (tommorrow) Tracking = Detection Detection = Segmentation Tracking (pose estimation) = Segmentation.
Segmentation To distinguish cow and horse? First segmentation problem
Aim Given an image, to segment the object Category Model Segmentation Cow Image Segmented Cow Segmentation should (ideally) be shaped like the object e.g. cow-like obtained efficiently in an unsupervised manner able to handle self-occlusion
Challenges Intra-Class Shape Variability Intra-Class Appearance Variability Self Occlusion
Motivation Magic Wand Current methods require user intervention Object and background seed pixels (Boykov and Jolly, ICCV 01) Bounding Box of object (Rother et al. SIGGRAPH 04) Object Seed Pixels Cow Image
Motivation Magic Wand Current methods require user intervention Object and background seed pixels (Boykov and Jolly, ICCV 01) Bounding Box of object (Rother et al. SIGGRAPH 04) Object Seed Pixels Background Seed Pixels Cow Image
Motivation Magic Wand Current methods require user intervention Object and background seed pixels (Boykov and Jolly, ICCV 01) Bounding Box of object (Rother et al. SIGGRAPH 04) Segmented Image
Motivation Magic Wand Current methods require user intervention Object and background seed pixels (Boykov and Jolly, ICCV 01) Bounding Box of object (Rother et al. SIGGRAPH 04) Object Seed Pixels Background Seed Pixels Cow Image
Motivation Magic Wand Current methods require user intervention Object and background seed pixels (Boykov and Jolly, ICCV 01) Bounding Box of object (Rother et al. SIGGRAPH 04) Segmented Image
Motivation Problem Manually intensive Segmentation is not guaranteed to be ‘object-like’ Non Object-like Segmentation
Our Method Borenstein and Ullman, ECCV ’02 Combine object detection with segmentation Borenstein and Ullman, ECCV ’02 Leibe and Schiele, BMVC ’03 Incorporate global shape priors in MRF Detection provides Object Localization Global shape priors Automatically segments the object Note our method completely generic Applicable to any object category model
Outline Problem Formulation Form of Shape Prior Optimization Results
Problem Labelling m over the set of pixels D Shape prior provided by parameter Θ Energy E (m,Θ) = ∑Φx(D|mx)+Φx(mx|Θ) + ∑ Ψxy(mx,my)+ Φ(D|mx,my) Unary terms Likelihood based on colour Unary potential based on distance from Θ Pairwise terms Prior Contrast term Find best labelling m* = arg min ∑ wi E (m,Θi) wi is the weight for sample Θi Unary terms Pairwise terms
MRF m (labels) D (pixels) mx Prior Ψxy(mx,my) my Probability for a labelling consists of Likelihood Unary potential based on colour of pixel Prior which favours same labels for neighbours (pairwise potentials) m (labels) mx Prior Ψxy(mx,my) my Unary Potential Φx(D|mx) x D (pixels) y Image Plane
Example … … … … … … … … Cow Image x x y y Likelihood Ratio (Colour) Background Seed Pixels Object Seed Pixels Φx(D|obj) x … x … Φx(D|bkg) Ψxy(mx,my) y … y … … … … … Likelihood Ratio (Colour) Prior
Example Cow Image Likelihood Ratio (Colour) Prior Background Seed Pixels Object Seed Pixels Likelihood Ratio (Colour) Prior
Contrast-Dependent MRF Probability of labelling in addition has Contrast term which favours boundaries to lie on image edges m (labels) mx my x Contrast Term Φ(D|mx,my) D (pixels) y Image Plane
Example … … … … … … … … Cow Image x x y y Likelihood Ratio (Colour) Background Seed Pixels Object Seed Pixels Φx(D|obj) x … x … Φx(D|bkg) Ψxy(mx,my)+ Φ(D|mx,my) y … y … … … … … Likelihood Ratio (Colour) Prior + Contrast
Example Cow Image Likelihood Ratio (Colour) Prior + Contrast Background Seed Pixels Object Seed Pixels Likelihood Ratio (Colour) Prior + Contrast
Our Model Θ (shape parameter) m (labels) D (pixels) Unary Potential Probability of labelling in addition has Unary potential which depend on distance from Θ (shape parameter) Θ (shape parameter) Unary Potential Φx(mx|Θ) m (labels) mx my Object Category Specific MRF x D (pixels) y Image Plane
Example Cow Image Shape Prior Θ Distance from Θ Prior + Contrast Background Seed Pixels Object Seed Pixels Shape Prior Θ Distance from Θ Prior + Contrast
Example Cow Image Shape Prior Θ Likelihood + Distance from Θ Background Seed Pixels Object Seed Pixels Shape Prior Θ Likelihood + Distance from Θ Prior + Contrast
Example Cow Image Shape Prior Θ Likelihood + Distance from Θ Background Seed Pixels Object Seed Pixels Shape Prior Θ Likelihood + Distance from Θ Prior + Contrast
Outline Problem Formulation Form of Shape Prior Optimization Results E (m,Θ) = ∑Φx(D|mx)+Φx(mx|Θ) + ∑ Ψxy(mx,my)+ Φ(D|mx,my) Form of Shape Prior Optimization Results
Detection BMVC 2004
Layered Pictorial Structures (LPS) Generative model Composition of parts + spatial layout Layer 2 Spatial Layout (Pairwise Configuration) Layer 1 Parts in Layer 2 can occlude parts in Layer 1
Layered Pictorial Structures (LPS) Cow Instance Layer 2 Transformations Θ1 P(Θ1) = 0.9 Layer 1
Layered Pictorial Structures (LPS) Cow Instance Layer 2 Transformations Θ2 P(Θ2) = 0.8 Layer 1
Layered Pictorial Structures (LPS) Unlikely Instance Layer 2 Transformations Θ3 P(Θ3) = 0.01 Layer 1
How to learn LPS From video via motion segmentation see Kumar Torr and Zisserman ICCV 2005.
LPS for Detection Learning Learnt automatically using a set of examples Detection Matches LPS to image using Loopy Belief Propagation Localizes object parts
Detection Like a proposal process.
Pictorial Structures (PS) Fischler and Eschlager. 1973 PS = 2D Parts + Configuration Aim: Learn pictorial structures in an unsupervised manner Layered Pictorial Structures (LPS) Parts + Configuration + Relative depth Identify parts Learn configuration Learn relative depth of parts
Pictorial Structures Affine warp of parts Each parts is a variable States are image locations AND affine deformation Affine warp of parts
Pictorial Structures Each parts is a variable States are image locations MRF favours certain configurations
Bayesian Formulation (MRF) D = image. Di = pixels Є pi , given li (PDF Projection Theorem. ) z = sufficient statistics ψ(li,lj) = const, if valid configuration = 0, otherwise. Pott’s model
Defining the likelihood We want a likelihood that can combine both the outline and the interior appearance of a part. Define features which will be sufficient statistics to discriminate foreground and background:
Features Outline: z1 Chamfer distance Interior: z2 Textons Model joint distribution of z1 z2 as a 2D Gaussian.
Chamfer Match Score Outline (z1) : minimum chamfer distances over multiple outline exemplars dcham= 1/n Σi min{ minj ||ui-vj ||, τ } Image Edge Image Distance Transform
Texton Match Score Texture(z2) : MRF classifier (Varma and Zisserman, CVPR ’03) Multiple texture exemplars x of class t Textons: 3 X 3 square neighbourhood VQ in texton space Descriptor: histogram of texton labelling χ2 distance
Bag of Words/Histogram of Textons Having slagged off BoW’s I reveal we used it all along, no big deal. So this is like a spatially aware bag of words model… Using a spatially flexible set of templates to work out our bag of words.
2. Fitting the Model Cascades of classifiers Solving MRF Efficient likelihood evaluation Solving MRF LBP, use fast algorithm GBP if LBP doesn’t converge Could use Semi Definite Programming (2003) Recent work second order cone programming method best CVPR 2006.
Efficient Detection of parts Cascade of classifiers Top level use chamfer and distance transform for efficient pre filtering At lower level use full texture model for verification, using efficient nearest neighbour speed ups.
Cascade of Classifiers-for each part Y. Amit, and D. Geman, 97?; S. Baker, S. Nayer 95
High Levels based on Outline (x,y)
Side Note Chamfer like linear classifier on distance transform image Felzenszwalb. Tree is a set of linear classifiers. Pictorial structure is a parameterized family of linear classifiers.
Low levels on Texture The top levels of the tree use outline to eliminate patches of the image. Efficiency: Using chamfer distance and pre computed distance map. Remaining candidates evaluated using full texture model.
Efficient Nearest Neighbour Goldstein, Platt and Burges (MSR Tech Report, 2003) Conversion from fixed distance to rectangle search bitvectorij(Rk) = 1 = 0 Nearest neighbour of x Find intervals in all dimensions ‘AND’ appropriate bitvectors Nearest neighbour search on pruned exemplars Rk Є Ii in dimension j
Recently solve via Integer Programming SDP formulation (Torr 2001, AI stats) SOCP formulation (Kumar, Torr & Zisserman this conference) LBP (Huttenlocher, many)
Outline Problem Formulation Form of Shape Prior Optimization Results
Optimization Given image D, find best labelling as m* = arg max p(m|D) Treat LPS parameter Θ as a latent (hidden) variable EM framework E : sample the distribution over Θ M : obtain the labelling m
E-Step Given initial labelling m’, determine p(Θ|m’,D) Problem Efficiently sampling from p(Θ|m’,D) Solution We develop efficient sum-product Loopy Belief Propagation (LBP) for matching LPS. Similar to efficient max-product LBP for MAP estimate Felzenszwalb and Huttenlocher, CVPR ‘04
Results Different samples localize different parts well. We cannot use only the MAP estimate of the LPS.
M-Step Given samples from p(Θ|m’,D), get new labelling mnew Sample Θi provides Object localization to learn RGB distributions of object and background Shape prior for segmentation Problem Maximize expected log likelihood using all samples To efficiently obtain the new labelling
M-Step w1 = P(Θ1|m’,D) Cow Image Shape Θ1 RGB Histogram for Object RGB Histogram for Background
M-Step Best labelling found efficiently using a Single Graph Cut w1 = P(Θ1|m’,D) Cow Image Shape Θ1 Θ1 m (labels) Image Plane D (pixels) Best labelling found efficiently using a Single Graph Cut
Segmentation using Graph Cuts Obj Cut Φx(D|bkg) + Φx(bkg|Θ) x … Ψxy(mx,my)+ Φ(D|mx,my) y … … … m z … … Φz(D|obj) + Φz(obj|Θ) Bkg
Segmentation using Graph Cuts Obj x … y … … … m z … … Bkg
M-Step w2 = P(Θ2|m’,D) Cow Image Shape Θ2 RGB Histogram for Object RGB Histogram for Background
M-Step Best labelling found efficiently using a Single Graph Cut w2 = P(Θ2|m’,D) Cow Image Shape Θ2 Θ2 m (labels) Image Plane D (pixels) Best labelling found efficiently using a Single Graph Cut
M-Step Θ1 Θ2 w1 + w2 + …. Image Plane Image Plane m* = arg min ∑ wi E (m,Θi) Best labelling found efficiently using a Single Graph Cut
Outline Problem Formulation Form of Shape Prior Optimization Results
Results Using LPS Model for Cow Image Segmentation
Results Using LPS Model for Cow In the absence of a clear boundary between object and background Image Segmentation
Results Using LPS Model for Cow Image Segmentation
Results Using LPS Model for Cow Image Segmentation
Results Using LPS Model for Horse Image Segmentation
Results Using LPS Model for Horse Image Segmentation
Results Image Our Method Leibe and Schiele
Results Shape Appearance Shape+Appearance Without Φx(D|mx) Without Φx(mx|Θ)
Face Detector and ObjCut
Do we really need accurate models? Segmentation boundary can be extracted from edges Rough 3D Shape-prior enough for region disambiguation
Energy of the Pose-specific MRF Energy to be minimized Unary term Pairwise potential Potts model Shape prior But what should be the value of θ?
The different terms of the MRF Likelihood of being foreground given a foreground histogram Likelihood of being foreground given all the terms Shape prior model Grimson- Stauffer segmentation Shape prior (distance transform) Resulting Graph-Cuts segmentation Original image
Can segment multiple views simultaneously Put the complete energy in this slide.
Solve via gradient descent Comparable to level set methods Could use other approaches (e.g. Objcut) Need a graph cut per function evaluation
Formulating the Pose Inference Problem Put the complete energy in this slide.
But… … to compute the MAP of E(x) w.r.t the pose, it means that the unary terms will be changed at EACH iteration and the maxflow recomputed! However… Kohli and Torr showed how dynamic graph cuts can be used to efficiently find MAP solutions for MRFs that change minimally from one time instant to the next: Dynamic Graph Cuts (ICCV05).
Dynamic Graph Cuts SA PA PB* PB SB differences between A and B solve Simpler problem PB* differences between A and B similar Talk about jigsaw; write that A and B are similar cheaper operation PB SB computationally expensive operation
Dynamic Image Segmentation Reuse the flows from the previous image frame (consecutive frames) Segmentation Obtained Flows in n-edges
Our Algorithm Ga Gb Maximum flow MAP solution G` difference residual graph (Gr) MAP solution First segmentation problem Ga G` difference between Ga and Gb updated residual graph Gb second segmentation problem Consistent problem G and G*
Dynamic Graph Cut vs Active Cuts Our method flow recycling AC cut recycling Both methods: Tree recycling
Experimental Analysis Running time of the dynamic algorithm MRF consisting of 2x105 latent variables connected in a 4-neighborhood.
Segmentation Comparison Grimson-Stauffer Bathia04 Our method
Segmentation Get rid of the ids and not the ideas
Segmentation Get rid of the ids and not the ideas
Conclusion Combining pose inference and segmentation worth investigating. Tracking = Detection Detection = Segmentation Tracking = Segmentation. Segmentation = SFM ??