3D Head Pose Tracking with Linear Depth and Brightness Constraints

Name: 3D Head Pose Tracking with Linear Depth and Brightness Constraints
Uploaded: 2017-12-05T10:41:31+00:00
Duration: PTM16S30
Channel: Cody Reeves
Description: 3D Head Pose Tracking with Linear Depth and Brightness Constraints

3D Head Pose Tracking with Linear Depth and Brightness Constraints
Hello everyone, and thanks for having me here today. I’m going to talk about some of the projects I worked on at Interval Research Corporation over the last few years. Most of these projects were in the area of real-time, person-centric, computer vision applications. A large part of what people use their visual systems for is to observe and interact with other people, so we felt that this would be a good thing to study and to try to replicate. The first thing I’ll talk about today, and what I’ll spend the most time on, is a project that we started about a year and a half ago, and presented last September at the International Conference on Computer Vision. The title of the paper we presented was “…”, and you can see the names of the other people who worked on this with me. And last time I checked, the paper was still online at the URL shown here, but no guarantees on that any more... M. Harville1, A. Rahimi2, T. Darrell2, G. Gordon3, J. Woodfill3 1: Hewlett-Packard Labs; 2: MIT AI Lab; 3: Tyzx Inc. Part of work was done while all authors were employed by Interval Research.

The Basic Problem to be Solved
We want to know the rotation (3 DOF) and translation (3 DOF) that a rigid object undergoes from one frame in a video to the next. So what do I mean by 3D pose-tracking? Well, the basic problem we are trying to solve on the way to 3D pose tracking is as follows: suppose you have a video sequence of some object moving around in a scene. We want to look at consecutive frames of this sequence, and be able to figure out how the object rotated and translated from one frame to the next. There are six degrees of freedom to this motion, 3 rotational and 3 translational. t t + D t In this case, the inter-frame motion can be expressed as rotation about a vertical axis, followed by rightward translation

The Basic Problem to be Solved (cont.)
Add up these incremental motions to get cumulative motion since start of video Motion estimation is equivalent to the tracking of object “pose”: position and orientation in some reference coordinate system. One way to visualize pose estimate: render axes in image as if they were rigidly affixed to object. If we can compute this reliably, then to get the motion of the object over the course of the entire video, we can simply add up these inter-frame motions. Computing the object motion in this way is equivalent to computing the object’s pose. The pose is defined as the object’s position and orientation relative to some reference coordinate system. One way to visualize pose is to imagine a coordinate system being rigidly affixed to the object as it moves through the scene. I’ve drawn a cartoon of this here, where I’ve glued a little set of axes to the middle of the front face of this cube. As the cube moves around, so do the axes, and the position and orientation of the axes at any given time tell me the pose of the cube at that time. t t + D t

Applications - Lots! Perceptual user interface: understanding of head gaze, gestures Virtual reality: avatars; prosthetic input devices Camera ego-motion: robot or mobile vehicle self-localization; panoramic scene-reconstruction from video Augmented reality: make rendered object in a scene move with scene even as camera turns Object-tracking: pick-and-place assembly machines; surveillance; automobile collision avoidance

Example: Head pose estimation
Approximate head as a rigid body. Want to know which way head is turned, and where it is in space.

The Inspiration In most situations, all you have is color or grayscale video from a single camera, and most prior methods have focused on how to solve the problem under these conditions => very difficult! Suppose you had a little more information: a registered, companion video of dense (per-pixel) depth. Now what would be the best thing to do, and how good is it?

Registered Intensity and Depth

The Sales Pitch for Our Solution
Under the assumption that, in addition to intensity and/or color information, you have dense depth from some source (e.g. stereo, laser, structured light), here is a method that... Is designed for speed (single linear system of equations) => good for real-time applications Does not require approximation of shape model or prior knowledge of object shape Provides superior or comparable accuracy to other methods

Prior Work:Feature-Based Methods
Common approaches General feature-tracking + Structure-from-Motion Eye / Nose / Mouth tracking + Rigid Head model State-of-the-art: Zelisky et. al. (Australia) Common problems Features disappear Rotation appears as Translation Depth change must be inferred from scale change Data are noisy: need to integrate information optimally over entire observation

An Alternative:Direct Motion Estimation
Use measurements based on change in image values rather than tracked features -> More robust -- doesn’t discard uncertainty information Express constraints directly on image values Pool information with least squares estimate over all pixels -> Not dependent on small set of key features Lots of prior work: Horn and Weldon ‘88, Bergen et al. ‘92, Black and Yacoob ‘95, Bregler and Malik ‘98, Stein and Shashua ‘98, ...

Some Variable Definitions
Points in Space and Points in Image 3D Coordinate System and Motion Parameters Z O X Camera Center of Projection Y System Input: I(x,y) and Z(x,y) at times t, t+1 System Output: inter-frame motion T and W

Direct Motion Estimation Using BCCE
Brightness Change Constraint Equation (BCCE): First-order Taylor series expansion: Matrix formulation:

Relate 2D velocities to 3D velocities via a camera projection model: Orthographic Perspective OR OR

Constrain 3D velocities to be consistent with rotation and translation of a single rigid body: For small angle rotations,

Chain these relations together to get one constraint equation per pixel: Orthographic Perspective Combine across pixels into one linear system and solve for [ T, W ] via QR or SVD. This equation is the basis for a host of methods, yet you rarely see it written this way. The reason is that...

Z unknown ! Past solutions: Assume approximate shape: planar (Black and Yacoob), ellipsoidal (Basu and Pentland; Bregler and Malik), polygonal (Essa et.al.), hyperquadrics, etc. Laser-scanned 3D model of object to be tracked Estimate depth and motion successively via linear or non-linear methods, or together with non-linear optimization => “open loop” issues … the Z’s in this equation are unknown! The ugly little fact of the matter is that you usually do not have real values to plug in here. People have taken a number of approaches to overcoming this, and this is largely the source of variety among the papers on direct motion estimation. One type of approach commonly used is to assume an approximate shape… leads to errors. Another thing people try is to have a detailed model in advance of the object to be tracked. This is not possible to do in many applications. Finally, perhaps the most mathematically impressive class of approaches is to solve for both the motion and the depth at the same time via some nonlinear optimization scheme, or perhaps to solve for one after the other in successive non-linear or maybe linear optimizations. Usually, if these approaches are not incredibly slow, because they are doing such a great job of optimizing, they are very unstable. Error multiplies quickly, usually do not see results for more than a few dozen frames.

“Direct Depth”: two new ideas
1. Use (independently measured) Z directly in BCCE Believe it or not, this appears to be novel. Frees us from shape model that is either approximate (e.g. planar, ellipsoidal, etc.) or which is known a priori. Shape model can change (slowly) over time: allows for 360 degree rotations, better handles non-rigidity. Related to Direct Motion Stereo of [Shieh et al.] and [Stein and Shashua], but their methods assume infinitesimal camera baselines and require coarse-to-fine solution if disparities >1 pixel are generated. Also, they compute motion before depth; we use depth directly.

“Direct Depth”: two new ideas
2. Express a direct constraint on the depth gradient. It operates on depth image very similarly to how the classic Brightness Change Constraint Equation (BCCE) applies to the intensity image. We call this the “Depth Change Constraint Equation”, or “DCCE”.

The DCCE Add in perspective projection and constrain to a single rigid motion: Very similar to our result for BCCE:

DCCE vs. BCCE Advantages of DCCE over BCCE
Depth information is more robust to lighting changes in space and time. The BCCE is an assumption that is true only for perfectly uniform illumination and Lambertian surfaces, whereas the DCCE is just a linearization of a generic description of motion in 3D. But…real-time depth data tends to be very noisy and full of holes! Smoothing seems to help.

Joint Constraint on Rigid Motion
Our proposal: combine the BCCE and DCCE constraint equations into a single linear system: Least squares problem, solve for six-parameter vector f via QR or SVD.

Some Important Practical Details
Support maps Only use constraint equations where depth and all depth derivatives are valid. Ignore locations of very high depth gradient (due to self-occlusion/disocclusion) Coordinate shift If center of coordinate system is far from object, it is easy to confuse translation with rotation about a distant axis, and vice versa -> numerical instability. Solution: At each time step, find object centroid, compute motion in coordinate system centered there, then transform motion parameters back to world coordinate system.

Experiments Synthetic and real sequences of moving heads
Synthetic sequences provide us with ground truth for quantitative analysis Real sequences show it’s not just theory. Hard cases: translation in Z, rotation out-of-plane Compare four motion estimation methods BCCE only with planar depth -> representative of standard methods BCCE only with measured depth DCCE only BCCE + DCCE

Synthetic Image Sequences
Generated color and depth image sequences by rendering a laser-scanned model of a human face with a standard graphics package. Rotation sequence Z-translation sequence

Synthetic Results - Rotation Sequence

Synthetic Results - Z-Trans Sequence

Real Data Sequence

Real Results: Still-Frame Comparison
Select Frames from BCCE+ planar depth => => Select Frames from BCCE+ DCCE => =>

Real Results: BCCE with Planar Depth

Real Results: BCCE + DCCE

Extensions and Future Work
Complement it with a slower, non-differential approach that helps detect and remove gross errors Real-time implementation! Experiment with some mathematical tweaks: Constrained or weighted least squares Use a second iteration per frame Add coarse-to-fine to handle large motions, if needed More ambitious tests: 360 degree rotation, slow non-rigidity, etc. => things few or no other methods can do

Extensions & Future Work
Apply direct depth and brightness constraint without rigid model: 3-D direct optic flow. Ego-motion: use joint depth and brightness constraint to recover camera motion. Articulated bodies: extend to use exponential twist formalism, a la Bregler and Malik. M. Covell, A. Rahimi, M. Harville, T. Darrell. "Articulated-pose estimation using brightness- and depth-constancy constraints.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head S.C., June 2000.

The End

3D Head Pose Tracking with Linear Depth and Brightness Constraints

Similar presentations

Presentation on theme: "3D Head Pose Tracking with Linear Depth and Brightness Constraints"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

3D Head Pose Tracking with Linear Depth and Brightness Constraints

Similar presentations

Presentation on theme: "3D Head Pose Tracking with Linear Depth and Brightness Constraints"— Presentation transcript:

Similar presentations

About project

Feedback