2D Human Pose Estimation in TV Shows Vittorio Ferrari Manuel Marin Andrew Zisserman Dagstuhl Seminar July 2008.

Slides:



Advertisements
Similar presentations
POSE–CUT Simultaneous Segmentation and 3D Pose Estimation of Humans using Dynamic Graph Cuts Mathieu Bray Pushmeet Kohli Philip H.S. Torr Department of.
Advertisements

1 Hierarchical Part-Based Human Body Pose Estimation * Ramanan Navaratnam * Arasanathan Thayananthan Prof. Phil Torr * Prof. Roberto Cipolla * University.
O BJ C UT M. Pawan Kumar Philip Torr Andrew Zisserman UNIVERSITY OF OXFORD.
Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.
Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.
Ľubor Ladický1 Phil Torr2 Andrew Zisserman1
Shape Sharing for Object Segmentation
Recovering Human Body Configurations: Combining Segmentation and Recognition Greg Mori, Xiaofeng Ren, and Jitentendra Malik (UC Berkeley) Alexei A. Efros.
- Recovering Human Body Configurations: Combining Segmentation and Recognition (CVPR’04) Greg Mori, Xiaofeng Ren, Alexei A. Efros and Jitendra Malik -
Large Scale Visual Recognition Challenge (ILSVRC) 2013: Detection spotlights.
Learning to estimate human pose with data driven belief propagation Gang Hua, Ming-Hsuan Yang, Ying Wu CVPR 05.
Articulated Pose Estimation with Flexible Mixtures of Parts
Bangpeng Yao Li Fei-Fei Computer Science Department, Stanford University, USA.
Structural Human Action Recognition from Still Images Moin Nabi Computer Vision Lab. ©IPM - Oct
Yuanlu Xu Advisor: Prof. Liang Lin Person Re-identification by Matching Compositional Template with Cluster Sampling.
2.Our Framework 2.1. Enforcing Temporal Consistency by Post Processing  Human Detection from Yang and Ramanan [1] Articulated Pose Estimation using Flexible.
LOCUS (Learning Object Classes with Unsupervised Segmentation) A variational approach to learning model- based segmentation. John Winn Microsoft Research.
Forward-Backward Correlation for Template-Based Tracking Xiao Wang ECE Dept. Clemson University.
Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.
Object-centric spatial pooling for image classification Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei ECCV 2012.
Enhancing Exemplar SVMs using Part Level Transfer Regularization 1.
Contour Based Approaches for Visual Object Recognition Jamie Shotton University of Cambridge Joint work with Roberto Cipolla, Andrew Blake.
Student: Yao-Sheng Wang Advisor: Prof. Sheng-Jyh Wang ARTICULATED HUMAN DETECTION 1 Department of Electronics Engineering National Chiao Tung University.
Cascaded Models for Articulated Pose Estimation
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc.
1. Introduction Humanising GrabCut: Learning to segment humans using the Kinect Varun Gulshan, Victor Lempitksy and Andrew Zisserman Dept. of Engineering.
TextonBoost : Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation J. Shotton*, J. Winn†, C. Rother†, and A.
The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects By John Winn & Jamie Shotton CVPR 2006 presented by Tomasz.
An Iterative Optimization Approach for Unified Image Segmentation and Matting Hello everyone, my name is Jue Wang, I’m glad to be here to present our paper.
Spatial Pyramid Pooling in Deep Convolutional
What, Where & How Many? Combining Object Detectors and CRFs
DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.
Generic object detection with deformable part-based models
A Scale and Rotation Invariant Approach to Tracking Human Body Part Regions in Videos Yihang BoHao Jiang Institute of Automation, CAS Boston College.
Action Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/11.
Shape-Based Human Detection and Segmentation via Hierarchical Part- Template Matching Zhe Lin, Member, IEEE Larry S. Davis, Fellow, IEEE IEEE TRANSACTIONS.
Building Face Dataset Shijin Kong. Building Face Dataset Ramanan et al, ICCV 2007, Leveraging Archival Video for Building Face DatasetsLeveraging Archival.
A General Framework for Tracking Multiple People from a Moving Camera
Pedestrian Detection and Localization
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Deformable Part Models (DPM) Felzenswalb, Girshick, McAllester & Ramanan (2010) Slides drawn from a tutorial By R. Girshick AP 12% 27% 36% 45% 49% 2005.
Template matching and object recognition. CS8690 Computer Vision University of Missouri at Columbia Matching by relations Idea: –find bits, then say object.
Tracking People by Learning Their Appearance Deva Ramanan David A. Forsuth Andrew Zisserman.
Learning the Appearance and Motion of People in Video Hedvig Sidenbladh, KTH Michael Black, Brown University.
MSRI workshop, January 2005 Object Recognition Collected databases of objects on uniform background (no occlusions, no clutter) Mostly focus on viewpoint.
Geodesic Saliency Using Background Priors
Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.
CVPR 2006 New York City Spatial Random Partition for Common Visual Pattern Discovery Junsong Yuan and Ying Wu EECS Dept. Northwestern Univ.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
O BJ C UT M. Pawan Kumar Philip Torr Andrew Zisserman UNIVERSITY OF OXFORD.
Discussion of Pictorial Structures Pedro Felzenszwalb Daniel Huttenlocher Sicily Workshop September, 2006.
Inference in generative models of images and video John Winn MSR Cambridge May 2004.
Pictorial Structures and Distance Transforms Computer Vision CS 543 / ECE 549 University of Illinois Ian Endres 03/31/11.
Recognition Using Visual Phrases
Object Recognition by Integrating Multiple Image Segmentations Caroline Pantofaru, Cordelia Schmid, Martial Hebert ECCV 2008 E.
Tracking Hands with Distance Transforms Dave Bargeron Noah Snavely.
Learning Image Statistics for Bayesian Tracking Hedvig Sidenbladh KTH, Sweden Michael Black Brown University, RI, USA
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Recent developments in object detection
Karel Lebeda, Simon Hadfield, Richard Bowden
Object detection with deformable part-based models
Nonparametric Semantic Segmentation
Dynamical Statistical Shape Priors for Level Set Based Tracking
A Tutorial on HOG Human Detection
“The Truth About Cats And Dogs”
Object Detection + Deep Learning
Outline Background Motivation Proposed Model Experimental Results
Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu
Human-object interaction
Presentation transcript:

2D Human Pose Estimation in TV Shows Vittorio Ferrari Manuel Marin Andrew Zisserman Dagstuhl Seminar July 2008

Goal: detect people and estimate 2D pose in images/video Pose spatial configuration of body parts Estimation localize body parts in Desired scope TV shows and feature films focus on upper-body Body parts fixed aspect-ratio rectangles for head, torso, upper and lower arms Ramanan and Forsyth, Mori et al., Sigal and Black, Felzenszwalb and Huttenlocher, Bergtholdt et al.

The search space of pictorial structures is large Search space - 4P dim (a scale factor per part) - 3P+1 dims (a single scale factor) - P = 6 for upper-body - 720x405 image = 10^45 configs ! Kinematic constraints - reduce space to valid body configs (10^28) - efficiency by model independencies (10^12) Body parts fixed aspect-ratio rectangles for head, torso, upper and lower arms = 4 parameters each Ramanan and Forsyth, Mori et al., Sigal and Black, Felzenszwalb and Huttenlocher, Bergtholdt et al.

The challenge of unconstrained imagery varying illumination and low contrast; moving camera and background; multiple people; scale changes; extensive clutter; any clothing

The challenge of unconstrained imagery Extremely difficult when knowing nothing about appearance/pose/location

Approach overview Human detection Foreground highlighting reduce search space estimate pose new additions Ramanan NIPS 2006 every frame independently multiple frames jointly Image parsing Learn appearance models over multiple frames Spatio-temporal parsing

Single frame

Search space reduction by human detection + fixes scale of body parts + sets bounds on x,y locations - little info about pose (arms) Idea get approximate location and scale with a detector generic over pose and appearance Building an upper-body detector - based on Dalal and Triggs CVPR train = 96 frames X 12 perturbations (no Buffy) + fast Benefits for pose estimation detectedenlarged + detects also back views Train Test

Search space reduction by human detection Upper body detection and temporal association code available!

Search space reduction by foreground highlighting + further reduce clutter + conservative (no loss 95.5% times) Idea exploit knowledge about structure of search area to initialize Grabcut (Rother et al. 2004) Initialization - learn fg/bg models from regions where person likely present/absent - clamp central strip to fg - don’t clamp bg (arms can be anywhere) Benefits for pose estimation + needs no knowledge of background + allows for moving background initialization output

Search space reduction by foreground highlighting + further reduce clutter + conservative (no loss 95.5% times) Initialization - learn fg/bg models from regions where person likely present/absent - clamp central strip to fg - don’t clamp bg (arms can be anywhere) Benefits for pose estimation + needs no knowledge of background + allows for moving background Idea exploit knowledge about structure of search area to initialize Grabcut (Rother et al. 2004)

Pose estimation by image parsing + much more robust + much faster (10x-100x) Goal estimate posterior of part configuration Algorithm 1. inference with = edges 2. learn appearance models of body parts and background 3. inference with = edges + appearance Advantages of space reduction = image evidence (given edge/app models) = spatial prior (kinematic constraints) Ramanan 06

Pose estimation by image parsing appearance models edge parse edge + app parse + much more robust + much faster (10x-100x) Goal estimate posterior of part configuration Algorithm 1. inference with = edges 2. learn appearance models of body parts and background 3. inference with = edges + appearance Advantages of space reduction = image evidence (given edge/app models) = spatial prior (kinematic constraints) Ramanan 06

Pose estimation by image parsing appearance models edge parse edge + app parse + much more robust + much faster (10x-100x) Goal estimate posterior of part configuration Algorithm 1. inference with = edges 2. learn appearance models of body parts and background 3. inference with = edges + appearance Advantages of space reduction no foreground highlighting = image evidence (given edge/app models) = spatial prior (kinematic constraints) Ramanan 06

Failure of direct pose estimation Ramanan NIPS 2006 unaided Ramanan 06

Multiple frames

Transferring appearance models across frames Idea refine parsing of difficult frames, based on appearance models from confident ones (exploit continuity of appearance) lowest-entropy frames higher-entropy frame Advantages + improve parse in difficult frames + better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video Algorithm 1. select frames with low entropy of P(L|I) 2. integrate their appearance models by min KL-divergence 3. re-parse every frame using integrated appearance models

lowest-entropy frames higher-entropy frame Idea refine parsing of difficult frames, based on appearance models from confident ones (exploit continuity of appearance) Transferring appearance models across frames Algorithm 1. select frames with low entropy of P(L|I) 2. integrate their appearance models by min KL-divergence 3. re-parse every frame using integrated appearance models Advantages + improve parse in difficult frames + better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video

Idea refine parsing of difficult frames, based on appearance models from confident ones (exploit continuity of appearance) Transferring appearance models across frames parse Algorithm 1. select frames with low entropy of P(L|I) 2. integrate their appearance models by min KL-divergence 3. re-parse every frame using integrated appearance models Advantages + improve parse in difficult frames + better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video

Idea refine parsing of difficult frames, based on appearance models from confident ones (exploit continuity of appearance) Transferring appearance models across frames re-parse Algorithm 1. select frames with low entropy of P(L|I) 2. integrate their appearance models by min KL-divergence 3. re-parse every frame using integrated appearance models Advantages + improve parse in difficult frames + better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video

Idea extend kinematic tree with edges preferring non-overlapping left/right arms Model - add repulsive edges - inference with Loopy Belief Propagation Advantage + less double-counting The repulsive model = kinematic constraints = repulsive prior

Idea extend kinematic tree with edges preferring non-overlapping left/right arms Model - add repulsive edges - inference with Loopy Belief Propagation Advantage + less double-counting The repulsive model = kinematic constraints = repulsive prior

Spatio-temporal parsing Spatio-temporal parsing Transferring appearance models exploits continuity of appearance Image parsing Learn appearance models over multiple frames After repulsive model Now exploit continuity of position time

Spatio-temporal parsing Spatio-temporal parsing Idea extend model to a spatio-temporal block time Model - add temporal edges - inference with Loopy Belief Propagation = image evidence (given edge+app models) = spatial prior (kinematic constraints) = temporal prior (continuity constraints)

Spatio-temporal parsing Spatio-temporal parsing Idea extend model to a spatio-temporal block time Model - add temporal edges - inference with Loopy Belief Propagation = image evidence (given edge+app models) = spatial prior (kinematic constraints) = temporal prior (continuity constraints) = repulsive prior (anti double-counting)

Spatio-temporal parsing Spatio-temporal parsing Integrated spatio-temporal parsing reduces uncertainty Image parsing Learn appearance models over multiple frames Spatio-temporal parsing spatio-temporal no spatio-temporal

Full-body pose estimation in easier conditions Weizmann action dataset (Blank et at. ICCV 05)

Evaluation dataset - >70000 frames over 4 episodes of Buffy the Vampire Slayer (>1000 shots) - uncontrolled and extremely challenging low contrast, scale changes, moving camera and background, extensive clutter, any clothing, any pose - figure overlays = final output, after spatio-temporal inference

Example estimated poses

Quantitative evaluation Ground-truth 69 shots x 4 = 276 frames x 6 = 1656 body parts (sticks) Upper-body detector fires on 243 frames (88%, 1458 body parts) Data available ! ~ vgg/data/stickmen/index.html

Quantitative evaluation Pose estimation accuracy 62.6% with repulsive model ( with spatio-temporal parsing) 59.4% with appearance transfer 57.9% with foreground highlighting (best single-frame) 41.2% with detection 9.6% Ramanan NIPS 2006 unaided Conclusions : + both reduction techniques improve results += small improvement by appearance transfer … + … method good also on static images + repulsive model brings us further = spatio-temporal parsing mostly aesthetic impact

Application: foreground/background segmentation Idea use localized body parts to initialize a spatio-temporal extension to Grabcut Initialization - partition shot into blocks - learn fg/bg models from eroded/dilated body part segmentations - clamp eroded to fg Spatio-temporal Grabcut - add edges over time (same edge model) - share fg/bg models over whole block

Application: foreground/background segmentation Applications - remove the actor before location recognition - get silhouette for action recognition Benefit + spatially and temporally continuous figure/ground segmentation

Video !