Stochastic Tracking of Humans Michael J. Black Department of Computer Science Brown University
Collaborators Hedvig Sidenbladh Royal Institute of Technology (KTH), Sweden Dirk Ormoneit and Trevor Hastie Dept. of Statistics, Stanford University David Fleet Xerox PARC and Queen’s University Allan Jepson University of Toronto
Goal: 3D Human Motion * 3D articulated model * Perspective projection * Monocular sequence * Unknown, cluttered, environment * Infer 3D human motion from 2D image motion.
Overview * Why is 3D human motion important? * Why is recovering it hard? * A Bayesian approach * generative model * robust likelihood function * temporal prior model (learning) * stochastic search (particle filtering) * Where are we going? * Recent advances & state of the art. * What remains to be done?
Why is it Important? Applications Human-Computer Interaction Surveillance Motion capture (games and animation) Video search/annotation Work practice analysis. Social display of puzzlement * detect moving regions * estimate motion * model articulated objects * model temporal patterns of activity * interpret the motion
Why is it Hard? The appearance of people can vary dramatically. Bones and joints are unobservable (muscle, skin, clothing hide the underlying structure). (inference)
Why is it hard? People can appear in arbitrary poses. They can deform in complex ways. Occlusion results in ambiguities and multiple interpretations.
Why is it hard? Geometrically under-constrained.
Other Problems * non-linear dynamics of limbs * similarity of appearance of different limbs (matching ambiguities) * image noise * outliers Our models are approximations. Image changes that are not modeled (e.g. clothing deformation) will be outliers.
Common Assumptions * Multiple Cameras (additional constraints, occlusion) * Color Images (locate face and hands) * Known Background (background subtraction to locate person) * Batch process an entire sequence. * Known Initialization (to be avoided)
Requirements 1. Represent uncertainty and multiple hypotheses. 2. Model non-linear dynamics of the body. 3. Exploit image cues in a robust fashion. 4. Integrate information over time. 5. Combine multiple image cues.
Bayesian Inference Build models of human form and motion. Learn priors over model parameters: p(model) Exploit cues in the images. Model robust likelihoods: p(image cue | model) Represent the posterior distribution p(model | cue) p(cue | model) p(model)
Problems A simple articulated human model may have 30+ parameters (e.g. joint angles. 60+ w/ velocities). Models of human action are non-linear and likelihood models will be multi-modal. Key challenges Key challenges (common to other domains) representation, learning, and search in high dimensional spaces.
Bayesian Formulation Represent a distribution over 3D poses. * define generative model of image appearance * multi-modal posterior over model parameters - sampled representation - particle filtering approach. * focus on image motion as a cue (adding edges,…)
Generative Model: Shape * 3D Articulated Body Model * pinhole camera * parameter vector =
Generative Model: Motion t-1 t Projection of image texture onto the 3D model Projection of model appearance into image coordinates
Appearance Model Could be many things * template (Cham & Rehg ‘99) * eigen-model (Sidenbladh, et al ‘00) * texture model * filter responses (edges, ridges, …) * learned over time Simple probabilistic model: Markov assumption
Noise Model Generative model: Mixture of Gaussian and uniform outlier distribution: Function of surface orientation
Generative Model: Temporal * general smooth motion or, * action-specific motion (walking) First order Markov assumption on angles, , and angular velocity, V: Explore two models of human motion
Bayesian Formulation Posterior over shape, velocity, and appearance given an image sequence. Likelihood of observing the image given the shape and appearance parameters Temporal model Posterior from previous time instant
Robust Likelihood For n random pixels from limb j compute: where
Temporal Model: Smooth Motion * individual angles and velocities assumed independent
What does the posterior look like? Shoulder: 3dof Elbow: 1dof Elbow bends
Particle Filtering * large literature (Gordon et al ‘93, Isard & Blake ‘96,…) * non-Gaussian posterior approximated by N discrete samples * explicitly represent the ambiguities * exploit stochastic sampling for tracking
Particle Filter sample sample normalize PosteriorTemporal dynamics LikelihoodPosterior
Arm Tracking: Smooth motion prior Particle filter * represents ambiguity * propagates information over time Display: expected value of joint angles.
Full-Body Tracking * parameter space too large * constrain posterior to valid 3D human motions. * learn generative models automatically from training data. time joint angles 3D motion-capture data: * segment into “movemes” (Bregler) * train probabilistic model. (from M. Gleicher)
Learning Temporal Models * Motion capture data is noisy, data is missing, activities are performed differently. * For cyclic motion (important but special class): 1. Detect cycles and segment 2. Account for missing data 3. Preserve continuity of cycles 4. Statistical model of variation * Approaches should generalize to non-cyclic motion. (Dirk Ormoneit & Trevor Hastie)
Detecting Cycles Automatically detect length of cycles, Automatically segment and align cycles.
Modeling Cyclic Motion Automatically align 3D data with a reference curve represented using periodically constrained regression splines.
Modeling Cyclic Motion * Iterative SVD method (from gene expression work) * computes SVD in Fourier domain * construct a rank-q approximation and take inverse Fourier transform * impute missing data from the approximation * repeat until convergence. * Segment into cycles, compute mean curve and represent variation by performing PCA on data. * SVD must enforce periodicity and cope with missing data.
Action-Specific Model The joint angles at time t are a linear combination of the basis motions evaluated at phase Mean curveBasis curves
Temporal Model: Walking Parameters of the generative model are now Probabilistic model for
Learned Walking Model * mean walker
Learned Walking Model * sample with small
Learned Walking Model * sample with moderate
Learned Walking Model * sample with very large
Stochastic 3D Tracking Stochastic 3D tracking (manual initialization) Use motion information to update and track distribution over time
Stochastic 3D Tracking * significant changes in view and depth. * template-based methods will fail.
No likelihood * how strong is the walking prior? (or is our likelihood doing anything?)
Issues * Large parameter space * approx samples * sparsely represented * not real time * Flow-based models can drift * Requires initialization
Lessons Learned * Probabilistic (Bayesian) framework allows - integration of information over time - modeling of priors - explicit generative image model * Particle filtering allows - multi-modal distributions - tracking with ambiguities and non-linear models * Weak image cues necessitate strong priors and many samples.
Work to be done * better appearance model - other cues (Color, edges, appearance,…) * automatic initialization using 2D models * learn more general models of motion * better occlusion model (new) * model of the background motion (new) * better representations of the posterior (Fleet&Chou) * better sampling methods (Fleet&Ormoneit) * adapt shape of limbs
Very preliminary work…
The Statistics of People in Images and Video How do people appear in natural scenes? Want a general model. Edge Filters Ridge Filters
Statistics of Images Ruderman. Lee, Mumford, Huang. Portilla and Simoncelli. Olshausen & Field. Xu, Wu, & Mumford. … Learning Pon and Poff for edge detection and road following: Geman and Jednyak Konishi, Yuille, and Coughlan
Example Training Images
Distribution of Filter Responses
Ratios for different limbs
Local Contrast Normalization
Likelihood Foreground pixels Background pixels
Benefits Generic model of appearance. Principled way to chose filters. Model of foreground and background is incorporated into the tracking framework. exploits the ratio between foreground and background likelihoods. improves tracking. Done the same for ridges and motion.
Outlook 5 years: - Relatively reliable people tracking in monocular video. - Path is pretty clear. … solve the vision problem. Next step: Beyond person-centric - people interacting with object/world Beyond that: Recognizing action - goals, intentions,... … solve the AI problem.
Some Related Work * Bregler & Malik: image motion, single hypothesis, full-body required multiple cameras, scaled ortho. * Ju, Black, Yacoob: cardboard person model, image motion, 2D * Deutscher et al: Condensation, edge cues, background subtraction. * Cham& Rehg: known templates, 2D (SPM), particle filter. * Wachter & Nagel: nicely combines motion and edges, single hypothesis (Kalman filter). * Leventon & Freeman: assumes 2D tracking, probabilistic formulation, learned temporal model (full body, monocular, articulated)
Conclusions Bayesian formulation for tracking 3D human figures using monocular image information. * Generative model of image appearance. * Non-linear model represents ambiguities, singularities occlusion, etc - sampled representation of posterior. * Particle filtering for incremental estimation. * Automatic learning of cyclic motion prior. Rich framework for modeling the complexity of human motion.
Initialization Using 2D Model * Full-body walking model. * Constructed from 3D mocap data. * 2D, view-based (every 30 degrees) * 4 subjects, 14 cycles
2D, View-Based Walker * Construct linear optical flow basis * Use similar Bayesian framework for tracking (Black CVPR’99) * Coarse estimate of 3D parameters * Automatic initialization Example Bases:... 0 degrees 90 degrees
Recent Results * Box indicates mean position and scale. * Recovers distribution over phase and 3D scale.
Contrast Normalization Contrast Normalization Locally weight image derivatives by Global contrast normalization (Lee, Mumford & Huang)
Optimizing the Filters Chose contrast normalization to maximize detection accuracy ROC curve Battacharyya Kullback-Leibler
Local Contrast Normalization
Representing the Posterior represented by discrete set of N samples Normalized likelihood:
Condensation 1. Selection Sample from posterior at t-1 Most probable states selected most often. 2. Prediction. 3. Updating
1. Selection 2. Prediction/Diffusion (sample from ) ie from the temporal prior: 1. Compute 2. Sample from 3. Sample from 3. Updating Condensation
states p Condensation 1. Selection 2. Prediction/Diffusion (sample from ) Models the dynamics: 3. Updating
Condensation 1. Selection 2. Prediction 3. Updating (the distribution) Evaluate new likelihood. Repeat until N new samples have been generated. Compute normalized probability distribution.
Visualizing Results Expected value of state parameter
Likelihood * To cope with occluded limbs or those viewed at narrow angles, we introduce a probability of occlusion. * likelihood of observing limb j is then * likelihood of the model is product of limb likelihoods
Indexing/Search The crux of the problem. The parameter space is huge. Brute force search is infeasible (ditto discretely sampling the space). Need to index into correct part of the space. Use a hierarchy of models of increasing complexity Images Generic Models (expansion, rotation,…) Coarse Object Models (EigenPeople) Detailed Models (shape & activity) Compute likelihood Index w/ Jepson & Fleet.
Initialization * new spatially constrained mixture model * find appropriate mid-level representations * initialize high level models using mid-level cues
Digital Video Analysis Social display of puzzlement To automatically analyze such a sequence we must * detect moving regions * estimate and interpret the motion * model complex articulated objects such as humans * model temporal patterns of activity
Tracking Moving Structure Next steps * split/merge/kill/initialize/grow/shrink operations * probabilistic search for best interpretation of the scene * detect more complex structures (articulation)
Generative Model
Mouth Training Data * 3000 image training sequence * motion estimated between pairs of frames * utterances: “center”, “print”, “track”, “release”
Learned Spatial Model * 3 basis flow fields account for 85% of variance. * fewer needed for recognition than for accurate estimation.
Mouth Temporal Models
Mouth Results
Results
Mouth Results
Let be the image measurements at time t. Let be a sequence of measurements from 0 to t. Bayesian Formulation Let be a state. We want * not Gaussian. Measurement likelihood. Can’t represented in closed form. Temporal prior. Can be sampled
Generative Model (Brightness Constancy) Optical flow
Representing the Posterior represented by discrete set of S samples
Stochastic Search * Particle filtering (Condensation): 1. Sample from posterior at time t Predict using temporal prior. 3. Evaluate likelihood. * Predict non-Gaussian distribution over time. * Update posterior with new measurements. * Allocate computational resources to effectively explore the space.
Generative Model: Motion t-1 t
Learned Walking Model * sample with large