Learning Layered Motion Segmentations of Video

Slides:

Advertisements

Similar presentations

Using Strong Shape Priors for Multiview Reconstruction Yunda SunPushmeet Kohli Mathieu BrayPhilip HS Torr Department of Computing Oxford Brookes University.

Advertisements

POSE–CUT Simultaneous Segmentation and 3D Pose Estimation of Humans using Dynamic Graph Cuts Mathieu Bray Pushmeet Kohli Philip H.S. Torr Department of.

Bayesian Belief Propagation

OBJ CUT & Pose Cut CVPR 05 ECCV 06

O BJ C UT M. Pawan Kumar Philip Torr Andrew Zisserman UNIVERSITY OF OXFORD.

Probabilistic Tracking and Recognition of Non-rigid Hand Motion

Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

Analysis of Contour Motions Ce Liu William T. Freeman Edward H. Adelson Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.

Agenda Introduction Bag-of-words models Visual words with spatial location Part-based models Discriminative methods Segmentation and recognition Recognition-based.

Part 4: Combined segmentation and recognition by Rob Fergus (MIT)

Interactive Segmentation with Super-Labels Andrew Delong Western Yuri BoykovOlga VekslerLena GorelickFrank Schmidt TexPoint fonts used in EMF. Read the.

Patch to the Future: Unsupervised Visual Prediction

Modeling the Shape of People from 3D Range Scans

LOCUS (Learning Object Classes with Unsupervised Segmentation) A variational approach to learning model- based segmentation. John Winn Microsoft Research.

GrabCut Interactive Image (and Stereo) Segmentation Carsten Rother Vladimir Kolmogorov Andrew Blake Antonio Criminisi Geoffrey Cross [based on Siggraph.

Simultaneous Segmentation and 3D Pose Estimation of Humans or Detection + Segmentation = Tracking? Philip H.S. Torr Pawan Kumar, Pushmeet Kohli, Matt Bray.

Learning to Detect A Salient Object Reporter: 鄭綱 (3/2)

Last Time Pinhole camera model, projection

Robust Higher Order Potentials For Enforcing Label Consistency

P 3 & Beyond Solving Energies with Higher Order Cliques Pushmeet Kohli Pawan Kumar Philip H. S. Torr Oxford Brookes University CVPR 2007.

High-Quality Video View Interpolation

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.

Stereo Computation using Iterative Graph-Cuts

Measuring Uncertainty in Graph Cut Solutions Pushmeet Kohli Philip H.S. Torr Department of Computing Oxford Brookes University.

Reconstructing Relief Surfaces George Vogiatzis, Philip Torr, Steven Seitz and Roberto Cipolla BMVC 2004.

Computer vision.

Autonomous Learning of Object Models on Mobile Robots Xiang Li Ph.D. student supervised by Dr. Mohan Sridharan Stochastic Estimation and Autonomous Robotics.

Feature and object tracking algorithms for video tracking Student: Oren Shevach Instructor: Arie nakhmani.

MRFs and Segmentation with Graph Cuts Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/24/10.

Object Stereo- Joint Stereo Matching and Object Segmentation Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Michael Bleyer Vienna.

#MOTION ESTIMATION AND OCCLUSION DETECTION #BLURRED VIDEO WITH LAYERS

CS 4487/6587 Algorithms for Image Analysis

Vision-based human motion analysis: An overview Computer Vision and Image Understanding(2007)

Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Associative Hierarchical CRFs for Object Class Image Segmentation

O BJ C UT M. Pawan Kumar Philip Torr Andrew Zisserman UNIVERSITY OF OXFORD.

Discussion of Pictorial Structures Pedro Felzenszwalb Daniel Huttenlocher Sicily Workshop September, 2006.

Non-Ideal Iris Segmentation Using Graph Cuts

Jigsaws: joint appearance and shape clustering John Winn with Anitha Kannan and Carsten Rother Microsoft Research, Cambridge.

Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.

Object Recognition by Integrating Multiple Image Segmentations Caroline Pantofaru, Cordelia Schmid, Martial Hebert ECCV 2008 E.

Part 4: combined segmentation and recognition Li Fei-Fei.

Photoconsistency constraint C2 q C1 p l = 2 l = 3 Depth labels If this 3D point is visible in both cameras, pixels p and q should have similar intensities.

Tracking Hands with Distance Transforms Dave Bargeron Noah Snavely.

Learning Decompositional Shape Models from Examples

LOCUS: Learning Object Classes with Unsupervised Segmentation

Nonparametric Semantic Segmentation

Video Google: Text Retrieval Approach to Object Matching in Videos

Fast Preprocessing for Robust Face Sketch Synthesis

Paper Presentation: Shape and Matching

Structure from motion Input: Output: (Tomasi and Kanade)

Representing Moving Images with Layers

Learning to Combine Bottom-Up and Top-Down Segmentation

Representing Moving Images with Layers

“The Truth About Cats And Dogs”

Combining Geometric- and View-Based Approaches for Articulated Pose Estimation David Demirdjian MIT Computer Science and Artificial Intelligence Laboratory.

PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD

Brief Review of Recognition + Context

Analysis of Contour Motions

Outline Background Motivation Proposed Model Experimental Results

Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu

Video Google: Text Retrieval Approach to Object Matching in Videos

Clustering appearance and shape by Jigsaw, and comparing it with Epitome. Papers (1) Clustering appearance and shape by learning jigsaws (2006 NIPS) (2)

The EM Algorithm With Applications To Image Epitome

EM Algorithm and its Applications

Structure from motion Input: Output: (Tomasi and Kanade)

“Traditional” image segmentation

Presentation transcript:

Learning Layered Motion Segmentations of Video UNIVERSITY OF OXFORD Learning Layered Motion Segmentations of Video M. Pawan Kumar Philip Torr Andrew Zisserman

Aim Given a video, to learn a model for the object Input Video Output Model Model should (ideally) describe the object completely and accurately handle self-occlusion be learnt in an unsupervised manner

Motivation Object Recognition and Segmentation Current object recognition methods often learn a model manually Hand-labelling position of parts OR Manually segmenting training images Leibe and Schiele, DAGM ‘04 Borenstein and Ullman, ECCV ‘02

Motivation Problem : Such ‘supervised’ methods are manually intensive and practically infeasible Solution Use readily available data such as videos Automatically learn models which can be used to perform object recognition.

Challenges Articulation Self Occlusion Lighting Motion Blur c(y) = diag(a) c(x) + b c(y) = ∫c(y-m(t)) dt

Using a Generative Model Parameters  Segments (mattes + appearance) Layering Transformations Tt Lighting parameters a and b Motion parameters m obtained using Tt-1 and Tt Latent Image per segment per frame

Learning the Model Given a video D we need to learn all model parameters  Segments (mattes + appearance) Layering Transformations Lighting and motion blur parameters We define the posterior Pr( | D) This measures how well the generated frames match the observed data We learn the ‘best’ model by maximizing Pr( | D)

Previous Work Sprite-based approach Jojic and Frey – ICCV ’01 Williams and Titsias – Neural Computation ‘04 Restricted to translation, rotation Greedy optimisation Spatial continuity not considered Motion blur, lighting not handled

Outline Model Description Learning the Model Results Initial Estimate Refining Mattes Updating appearance Refining Transformation Results

Model Description Layered Representation Mattes of segments represented as binary masks. Appearance of part – RGB value per point  T – translation, rotation and anisotropic scale factors

Layering Layer number li for segment pi For non-overlapping segments li = lj li > lj

Layering Layer number li for segment pi For non-overlapping segments li = lj li < lj

Energy of the model Energy  = -log (Pr(D| )) Pr( | D) Pr(D| ) Pr() Energy  = -log (Pr(D| )) Maximize Pr( | D) implies Minimize   = Appearance + Boundary

Appearance Appearance measures consistency of observed and generated RGB values over the entire video sequence Generated Frames - - - - Observed Frames + Appearance

Boundary x y If intensity of x and y are similar, penalty is more. Boundary gives preference to parts that are separated by edges in most frames x y If intensity of x and y are similar, penalty is more. different, penalty is less. Penalty on Energy 

Our Approach 1) An initial estimate of  is obtained by dividing the scene into rigidly moving components. 2) Mattes are optimised using graph cuts. 3) Appearance parameters are updated. 4) Transformation, lighting, motion blur are re-estimated.

Outline Model Description Learning the Model Results Initial Estimate Refining Mattes Updating appearance Refining Transformation Results

Rectangular patches fi 1. Initial Estimate Divide Frame n Rectangular patches fi e.g. 3x3 Track Reconstructed Frame n+1

Tracking Patches Patch fk Transformation tk … … … … … … Frame n Patch fk Transformation tk n1 n2 n3 … nj … … … (tk) = 0.6 nk … … MRF over patches Frame n+1

Tracking Patches Patch fk Transformation tk … … … … … … Frame n Patch fk Transformation tk n1 n2 n3 … (tk) = 0.9 nj … … … nk … … MRF over patches Frame n+1

Tracking Patches Patch fk Transformation tk … … … … … … Frame n Patch fk Transformation tk (tk) = 0.7 n1 n2 n3 … nj … … … nk … … MRF over patches Frame n+1

Tracking Patches … … … … … … (tj,tk) = d1jk if rigid motion Frame n nj … … … nk … … (tj,tk) = d1jk if rigid motion Frame n+1

Tracking Patches … … … … … … (tj,tk) = d2 otherwise jk Frame n nj … … … nk … … (tj,tk) = d2 otherwise jk Frame n+1

Tracking Patches Pr(t) (ti) (ti,tj) Inference using belief propagation Time complexity Speed-up using Distance Transforms Felzenszwalb and Huttenlocher, NIPS 2004 Memory requirements Coarse-to-fine strategy Vogiatzis et al., BMVC 2004 Multiple coarse labels chosen instead of best one

Coarse-to-fine Strategy … Similar labels nj … … … nk … … Original MRF

Coarse-to-fine Strategy … nj … … … (Ti) = maxj (tj) nk … … Group similar labels into one representative label

Coarse-to-fine Strategy … (Ti,Tj) = maxk,l (tk,tl) nj … … … nk … … Solve the ‘coarser’ MRF using Belief Propagation

Coarse-to-fine Strategy … Best Labels nj … … … nk … … Choose ‘m’ best representative labels per site

Coarse-to-fine Strategy … nj … … … nk … … Expand the labels to obtain a ‘smaller’ MRF

Tracking Patches

Initial Estimate Cluster rigidly moving points to obtain components Frame n Frame n+1 Components

Initial Estimate Cluster components based on appearance (cross-correlation) Smallest member of a cluster is a segment Components Segments

Object is not described completely Layering is not determined We need to refine this estimate by minimizing  Re-label surrounding points using consistency of motion consistency of texture Form of  suggests using Graph Cuts

Graph Cuts Consider the case of two segments. W(x1,ph) x1 x2 x3 … xj … … … W(xj,xk) xk … … xn Form of energy function. Examples of functions that can be minimized and cannot be minimized. W(xn,pt) pt W(xi,pj) appearance component W(xj,xk) boundary component

Graph Cuts ph … … … … … … pt W(x1,ph) x1 x2 x3 xj W(xj,xk) xk xn W(xn,pt) pt

Graph Cuts The energy  is of the form  D(fX) +  V(fX,fY) V is called regular if V(0,0) + V(1,1) <= V(0,1) + V(1,0) For LPS, V is regular. Theorem : If V is regular, then the minimum cut minimizes energy  -Kolmogorov and Zabih, PAMI ‘04.

Multi-way Graph Cuts Each cut assigns label pi and ~pi to points in binary matte of segment pi Number of cuts = Number of parts Ideally, all cuts must be found simultaneously NP-hard problem -swap/ -expansion algorithm

-swap One pair of parts is considered at a time. Relabel One pair of parts is considered at a time. All other parts are kept fixed. Points belonging to one part can be re-labelled as the other part. Fixed

-expansion Iteratively find graph cuts A cut corresponding to one Refine Iteratively find graph cuts A cut corresponding to one part is considered at a time All other parts are kept fixed Theorem: -expansion finds a strong local minima. Fixed

Outline Model Description Learning the Model Results Initial Estimate Refining Mattes Updating appearance Refining Transformation Results

2. Refining Mattes Consider one segment at a time (along with its neighbouring segments) Segment to be refined Neighbouring Segment

2. Refining Mattes   Apply -swap Neighbouring Segment Segment to be refined Neighbouring Segment

2. Refining Mattes   Apply -swap Neighbouring Segment Segment to be refined Neighbouring Segment

2. Refining Mattes  Apply -expansion Neighbouring Segment Segment to be refined Neighbouring Segment

2. Refining Mattes  Apply -expansion Neighbouring Segment Segment to be refined Neighbouring Segment

2. Refining Mattes  Apply -expansion Refined Segment Neighbouring Segment Iterate over segments till energy  cannot be minimized further.

#iterations Mattes Frame 1 Frame 30

#iterations Mattes 1 Frame 1 Frame 30

#iterations Mattes 2 Frame 1 Frame 30

#iterations Mattes 3 Frame 1 Frame 30

#iterations Mattes 4 Frame 1 Frame 30

#iterations Mattes 5 Frame 1 Frame 30

#iterations Mattes 6 Frame 1 Frame 30

#iterations Mattes 7 Frame 1 Frame 30

#iterations Mattes 8 Frame 1 Frame 30

#iterations Mattes 9 Frame 1 Frame 30

Outline Model Description Learning the Model Results Initial Estimate Refining Mattes Updating appearance Refining Transformation Results

4. Refining Transformations 3. Updating Appearance Appearance of a point is the mean of RGB values of all visible points it projects onto. 4. Refining Transformations Transformations around initial estimate are explored. The transformation resulting in least SSD is chosen.

4. Refining Transformations 3. Updating Appearance Appearance of a point is the mean of RGB values of all visible points it projects onto. 4. Refining Transformations Transformations around initial estimate are explored. The transformation resulting in least SSD is chosen.

Outline Model Description Learning the Model Results Initial Estimate Refining Mattes Updating appearance Refining Transformation Results

Results

Results – Complex Motion

Results – Poor Quality Video

Applications The learnt model is used for several applications Motion Segmentation Object Recognition Object Category Specific Segmentation

Object Recognition Matching the model to still images Multiple shape exemplars and texture examples Extending Pictorial Structures for Object Recognition – BMVC ‘04

Class-Specific Segmentation Global shape prior for graph cut based segmentation OBJ CUT – CVPR ‘05

Conclusions and Future Work We have presented a method for unsupervised learning of a generative model from videos. Applications for object recognition and segmentation are demonstrated. Method needs to be extended to handle various visual aspects.