Learning Layered Motion Segmentations of Video UNIVERSITY OF OXFORD Learning Layered Motion Segmentations of Video M. Pawan Kumar Philip Torr Andrew Zisserman
Aim Given a video, to learn a model for the object Input Video Output Model Model should (ideally) describe the object completely and accurately handle self-occlusion be learnt in an unsupervised manner
Motivation Object Recognition and Segmentation Current object recognition methods often learn a model manually Hand-labelling position of parts OR Manually segmenting training images Leibe and Schiele, DAGM ‘04 Borenstein and Ullman, ECCV ‘02
Motivation Problem : Such ‘supervised’ methods are manually intensive and practically infeasible Solution Use readily available data such as videos Automatically learn models which can be used to perform object recognition.
Challenges Articulation Self Occlusion Lighting Motion Blur c(y) = diag(a) c(x) + b c(y) = ∫c(y-m(t)) dt
Using a Generative Model Parameters Segments (mattes + appearance) Layering Transformations Tt Lighting parameters a and b Motion parameters m obtained using Tt-1 and Tt Latent Image per segment per frame
Learning the Model Given a video D we need to learn all model parameters Segments (mattes + appearance) Layering Transformations Lighting and motion blur parameters We define the posterior Pr( | D) This measures how well the generated frames match the observed data We learn the ‘best’ model by maximizing Pr( | D)
Previous Work Sprite-based approach Jojic and Frey – ICCV ’01 Williams and Titsias – Neural Computation ‘04 Restricted to translation, rotation Greedy optimisation Spatial continuity not considered Motion blur, lighting not handled
Outline Model Description Learning the Model Results Initial Estimate Refining Mattes Updating appearance Refining Transformation Results
Model Description Layered Representation Mattes of segments represented as binary masks. Appearance of part – RGB value per point T – translation, rotation and anisotropic scale factors
Layering Layer number li for segment pi For non-overlapping segments li = lj li > lj
Layering Layer number li for segment pi For non-overlapping segments li = lj li < lj
Energy of the model Energy = -log (Pr(D| )) Pr( | D) Pr(D| ) Pr() Energy = -log (Pr(D| )) Maximize Pr( | D) implies Minimize = Appearance + Boundary
Appearance Appearance measures consistency of observed and generated RGB values over the entire video sequence Generated Frames - - - - Observed Frames + Appearance
Boundary x y If intensity of x and y are similar, penalty is more. Boundary gives preference to parts that are separated by edges in most frames x y If intensity of x and y are similar, penalty is more. different, penalty is less. Penalty on Energy
Our Approach 1) An initial estimate of is obtained by dividing the scene into rigidly moving components. 2) Mattes are optimised using graph cuts. 3) Appearance parameters are updated. 4) Transformation, lighting, motion blur are re-estimated.
Outline Model Description Learning the Model Results Initial Estimate Refining Mattes Updating appearance Refining Transformation Results
Rectangular patches fi 1. Initial Estimate Divide Frame n Rectangular patches fi e.g. 3x3 Track Reconstructed Frame n+1
Tracking Patches Patch fk Transformation tk … … … … … … Frame n Patch fk Transformation tk n1 n2 n3 … nj … … … (tk) = 0.6 nk … … MRF over patches Frame n+1
Tracking Patches Patch fk Transformation tk … … … … … … Frame n Patch fk Transformation tk n1 n2 n3 … (tk) = 0.9 nj … … … nk … … MRF over patches Frame n+1
Tracking Patches Patch fk Transformation tk … … … … … … Frame n Patch fk Transformation tk (tk) = 0.7 n1 n2 n3 … nj … … … nk … … MRF over patches Frame n+1
Tracking Patches … … … … … … (tj,tk) = d1jk if rigid motion Frame n nj … … … nk … … (tj,tk) = d1jk if rigid motion Frame n+1
Tracking Patches … … … … … … (tj,tk) = d2 otherwise jk Frame n nj … … … nk … … (tj,tk) = d2 otherwise jk Frame n+1
Tracking Patches Pr(t) (ti) (ti,tj) Inference using belief propagation Time complexity Speed-up using Distance Transforms Felzenszwalb and Huttenlocher, NIPS 2004 Memory requirements Coarse-to-fine strategy Vogiatzis et al., BMVC 2004 Multiple coarse labels chosen instead of best one
Coarse-to-fine Strategy … Similar labels nj … … … nk … … Original MRF
Coarse-to-fine Strategy … nj … … … (Ti) = maxj (tj) nk … … Group similar labels into one representative label
Coarse-to-fine Strategy … (Ti,Tj) = maxk,l (tk,tl) nj … … … nk … … Solve the ‘coarser’ MRF using Belief Propagation
Coarse-to-fine Strategy … Best Labels nj … … … nk … … Choose ‘m’ best representative labels per site
Coarse-to-fine Strategy … nj … … … nk … … Expand the labels to obtain a ‘smaller’ MRF
Tracking Patches
Initial Estimate Cluster rigidly moving points to obtain components Frame n Frame n+1 Components
Initial Estimate Cluster components based on appearance (cross-correlation) Smallest member of a cluster is a segment Components Segments
Object is not described completely Layering is not determined We need to refine this estimate by minimizing Re-label surrounding points using consistency of motion consistency of texture Form of suggests using Graph Cuts
Graph Cuts Consider the case of two segments. W(x1,ph) x1 x2 x3 … xj … … … W(xj,xk) xk … … xn Form of energy function. Examples of functions that can be minimized and cannot be minimized. W(xn,pt) pt W(xi,pj) appearance component W(xj,xk) boundary component
Graph Cuts ph … … … … … … pt W(x1,ph) x1 x2 x3 xj W(xj,xk) xk xn W(xn,pt) pt
Graph Cuts The energy is of the form D(fX) + V(fX,fY) V is called regular if V(0,0) + V(1,1) <= V(0,1) + V(1,0) For LPS, V is regular. Theorem : If V is regular, then the minimum cut minimizes energy -Kolmogorov and Zabih, PAMI ‘04.
Multi-way Graph Cuts Each cut assigns label pi and ~pi to points in binary matte of segment pi Number of cuts = Number of parts Ideally, all cuts must be found simultaneously NP-hard problem -swap/ -expansion algorithm
-swap One pair of parts is considered at a time. Relabel One pair of parts is considered at a time. All other parts are kept fixed. Points belonging to one part can be re-labelled as the other part. Fixed
-expansion Iteratively find graph cuts A cut corresponding to one Refine Iteratively find graph cuts A cut corresponding to one part is considered at a time All other parts are kept fixed Theorem: -expansion finds a strong local minima. Fixed
Outline Model Description Learning the Model Results Initial Estimate Refining Mattes Updating appearance Refining Transformation Results
2. Refining Mattes Consider one segment at a time (along with its neighbouring segments) Segment to be refined Neighbouring Segment
2. Refining Mattes Apply -swap Neighbouring Segment Segment to be refined Neighbouring Segment
2. Refining Mattes Apply -swap Neighbouring Segment Segment to be refined Neighbouring Segment
2. Refining Mattes Apply -expansion Neighbouring Segment Segment to be refined Neighbouring Segment
2. Refining Mattes Apply -expansion Neighbouring Segment Segment to be refined Neighbouring Segment
2. Refining Mattes Apply -expansion Refined Segment Neighbouring Segment Iterate over segments till energy cannot be minimized further.
#iterations Mattes Frame 1 Frame 30
#iterations Mattes 1 Frame 1 Frame 30
#iterations Mattes 2 Frame 1 Frame 30
#iterations Mattes 3 Frame 1 Frame 30
#iterations Mattes 4 Frame 1 Frame 30
#iterations Mattes 5 Frame 1 Frame 30
#iterations Mattes 6 Frame 1 Frame 30
#iterations Mattes 7 Frame 1 Frame 30
#iterations Mattes 8 Frame 1 Frame 30
#iterations Mattes 9 Frame 1 Frame 30
Outline Model Description Learning the Model Results Initial Estimate Refining Mattes Updating appearance Refining Transformation Results
4. Refining Transformations 3. Updating Appearance Appearance of a point is the mean of RGB values of all visible points it projects onto. 4. Refining Transformations Transformations around initial estimate are explored. The transformation resulting in least SSD is chosen.
4. Refining Transformations 3. Updating Appearance Appearance of a point is the mean of RGB values of all visible points it projects onto. 4. Refining Transformations Transformations around initial estimate are explored. The transformation resulting in least SSD is chosen.
Outline Model Description Learning the Model Results Initial Estimate Refining Mattes Updating appearance Refining Transformation Results
Results
Results – Complex Motion
Results – Poor Quality Video
Applications The learnt model is used for several applications Motion Segmentation Object Recognition Object Category Specific Segmentation
Object Recognition Matching the model to still images Multiple shape exemplars and texture examples Extending Pictorial Structures for Object Recognition – BMVC ‘04
Class-Specific Segmentation Global shape prior for graph cut based segmentation OBJ CUT – CVPR ‘05
Conclusions and Future Work We have presented a method for unsupervised learning of a generative model from videos. Applications for object recognition and segmentation are demonstrated. Method needs to be extended to handle various visual aspects.