Learning the Appearance and Motion of People in Video Hedvig Sidenbladh, KTH Michael Black, Brown University.

Learning the Appearance and Motion of People in Video Hedvig Sidenbladh, KTH hedvig@nada.kth.se, www.nada.kth.se/~hedvig/ Michael Black, Brown University black@cs.brown.edu, www.cs.brown.edu/people/black/

Collaborators David Fleet, Xerox PARC fleet@parc.xerox.com Dirk Ormoneit, Stanford University ormoneit@stat.stanford.edu Jan-Olof Eklundh, KTH joe@nada.kth.se

GoalGoal Tracking and reconstruction of human motion in 3D –Articulated 3D model –Monocular sequence –Pinhole camera model –Unknown, cluttered environment

Why is it Important? Human-machine interaction –Robots –Intelligent rooms Video search Animation, motion capture Surveillance

Why is it Hard? Structure is unobservable - inference from visible parts People appear in many ways - how find a model that fits them all?

Why is it Hard? People move fast and non- linearly 3D to 2D projection ambiguities Large occlusion Similar appearance of different limbs Large search space Extreme case

Exploit cues in the images. Learn likelihood models: p(image cue | model) Build models of human form and motion. Learn priors over model parameters: p(model) Represent the posterior distribution: p(model | cue) p(cue | model) p(model) BayesianInference Bayesian Inference

Human Model Limbs = truncated cones in 3D Pose determined by parameters 

Bregler and Malik ‘98 State of the Art. Brightness constancy cue –Insensitive to appearance Full-body required multiple cameras Single hypothesis

Brightness Constancy I(x, t+1) = I(x+u, t) +  Image motion of foreground as a function of the 3D motion of the body. Problem: no fixed model of appearance (drift).

Cham and Rehg ‘99 State of the Art. I(x, t) = I(x+u, 0) +  Single camera, multiple hypotheses 2D templates (no drift but view dependent)

Multiple Hypotheses Posterior distribution over model parameters often multi- modal (due to ambiguities) Represent whole distribution: –sampled representation –each sample is a pose –predict over time using a particle filtering approach

State of the Art. Deutscher, North, Bascle, & Blake ‘00 Multiple hypotheses Multiple cameras Simplified clothing, lighting and background

State of the Art. Sidenbladh, Black, & Fleet ‘00 Multiple hypotheses Monocular Brightness constancy Activity specific prior Significant changes in view and depth, template-based methods will fail

–Need a constraining likelihood model that is also invariant to variations in human appearance How to Address the Problems –Need a good model of how people move –Need an effective way to explore the model space (very high dimensional) Bayesian formulation: p(model | cue) p(cue | model) p(model)

Changing background Low contrast limb boundaries Occlusion Varying shadows Deforming clothing What do people look like? What do non-people look like?

Edge Detection? Probabilistic model? Under/over-segmentation, thresholds, …

Key Idea #1 Use the 3D model to predict the location of limb boundaries in the scene. Compute various filter responses steered to the predicted orientation of the limb. Compute likelihood of filter responses using a statistical model learned from examples.

“Explain” the entire image. p(image | foreground, background) Generic, unknown, background Foreground person Key Idea #2

p(image | foreground, background)  Do not look in parts of the image considered background Foreground part of image Key Idea #2 p(foreground part of image | foreground) p(foreground part of image | background)

Training Data Points on limbsPoints on background

Edge Distributions Edge response steered to model edge: Similar to Konishi et al., CVPR 99

Edge Likelihood Ratio Edge responseLikelihood ratio

Ridge Distributions Ridge response steered to limb orientation Ridge response only on certain image scales!

Ridge Likelihood Ratio Ridge responseLikelihood ratio

Motion Training Data Motion response = I(x, t+1) - I(x+u, t) x x+u Motion response = temporal brightness change given model of motion = noise term in brightness constancy assumption

Motion distributions Different underlying motion models

Fg, Bg Likelihood Temporal change, bg Temporal change, limb Likelihood, bg Likelihood, limb

Likelihood Foreground pixels Background pixels

–Need a constraining likelihood model that is also invariant to variations in human appearance Step One Discussed… –Need a good model of how people move Bayesian formulation: p(model | cue) p(cue | model) p(model)

Models of Human Dynamics Model of dynamics are used to propagate the sampled distribution in time Constant velocity model –All DOF in the model parameter space, , independent –Angles are assumed to change with constant speed –Speed and position changes are randomly sampled from normal distribution

Models of Human Dynamics Action-specific model - Walking –Training data: 3D motion capture data –From training set, learn mean cycle and common modes of deviation (PCA) Mean cycleSmall noiseLarge noise

–Need a constraining likelihood model that is also invariant to variations in human appearance Step Two Also Discussed… –Need a good model of how people move –Need an effective way to explore the model space (very high dimensional) Bayesian formulation: p(model | cue) p(cue | model) p(model)

Particle Filter sample sample normalize PosteriorTemporal dynamics LikelihoodPosterior Problem: Expensive represententation of posterior! Approaces to solve problem: Lower the number of samples. (Deutsher et al., CVPR00) Represent the space in other ways (Choo and Fleet, ICCV01)

Tracking an Arm Moving camera, constant velocity model 1500 samples ~2 min/frame

Self Occlusion Constant velocity model 1500 samples ~2 min/frame

Walking Person Walking model 2500 samples ~10 min/frame #samples from 15000 to 2500 by using the learned likelihood

Ongoing and Future Work Learned dynamics Correlation across scale Estimate background motion Statistical models of color and texture Automatic initialization

Lessons Learned Probabilistic (Bayesian) framework allows –Integration of information in a principled way –Modeling of priors Particle filtering allows –Multi-modal distributions –Tracking with ambiguities and non-linear models Learning image statistics and combining cues improves robustness and reduces computation

Conclusions Generic, learned, model of appearance –Combines multiple cues –Exploits work on image statistics Use the 3D model to predict features Model of foreground and background –Exploits the ratio between foreground and background likelihood –Improves tracking

Other Related Work J. Sullivan, A. Blake, M. Isard, and J.MacCormick. Object localization by Bayesian correlation. ICCV99. J. Sullivan, A. Blake, and J.Rittscher. Statistical foreground modelling for object localisation. ECCV00. J. Rittscher, J. Kato, S. Joga, and A. Blake. A Probabilistic Background Model for Tracking. ECCV00. S. Wachter and H. Nagel. Tracking of persons in monocular image sequences. CVIU, 74(3), 1999.

Learning the Appearance and Motion of People in Video Hedvig Sidenbladh, KTH Michael Black, Brown University.

Similar presentations

Presentation on theme: "Learning the Appearance and Motion of People in Video Hedvig Sidenbladh, KTH Michael Black, Brown University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning the Appearance and Motion of People in Video Hedvig Sidenbladh, KTH Michael Black, Brown University.

Similar presentations

Presentation on theme: "Learning the Appearance and Motion of People in Video Hedvig Sidenbladh, KTH Michael Black, Brown University."— Presentation transcript:

Similar presentations

About project

Feedback