Real Time Dense 3D Reconstructions: KinectFusion (2011) and Fusion4D (2016) Eleanor Tursman.

Real Time Dense 3D Reconstructions: KinectFusion (2011) and Fusion4D (2016)
Eleanor Tursman

Previous Work Monocular SLAM Multiview stereo
Not real-time Monocular SLAM RGB camera, real-time, sparse Pose estimation using iterative closest point (ICP) Active sensors using signed distance function (SDF) for dense reconstruction instead of meshes Feature based monocular SLAM system ORB-SLAM

Previous Work contd. Dense, real-time reconstruction using SDF fusion
Gap to fill? Dense, real-time reconstruction using SDF fusion Global reconstruction instead of frame-to-frame Independent of lighting in the scene using structured light

KinectFusion Overview
Real time dense reconstruction of static scenes Works on sensors that generate depth maps quickly (30Hz) Implicit surface construction Implicit surface: set of all points that satisfy some f(x,y,z) = 0. Simultaneous Localization and Mapping (SLAM)

Pipeline Image from Microsoft’s KinectFusion Page

Depth Map Conversion Process raw depth data so that camera’s rigid transform can be calculated Input: raw depth map from Kinect at some time k Output: vertex map and normal map in camera coordinate system

Depth Map Conversion contd.
Details: Apply bilateral filter to smooth noise while preserving sharp edges Calculate vertex map and normal map using intrinsic camera parameters Make a 3-level pyramid representation of both maps

Volumetric Integration
Fuse current raw depth map into global scene truncated signed distance function (TSDF) camera pose estimation Input: raw depth map, camera pose 𝑇 𝑔,𝑘 Output: global TSDF, rigid transform matrix from previous frame 𝑇 𝑔,𝑘−1

Volumetric Integration contd.
Details: TSDF: discrete version of SDF. Further truncate by taking points within some ± μ of the surface in TSDF Parallelizable Nearest neighbor lookup instead of interpolation of depth values to avoid smearing

Volumetric Integration contd.
Details: Converge towards SDF Fusion by taking weighted average of TSDFs for every depth map Use raw depth map, not bilaterally filtered map

Ray-casting Render a surface prediction made by the zero level set in the global TSDF using the previous camera pose estimation Input: global TSDF, rigid transform of previous frame 𝑇 𝑔,𝑘−1 Output: predicted vertex and normal maps

Ray-casting contd. Details:
Raycast TSDF (the current world volume) to render the 3D volume Use ray skipping to accelerate process Simplified version of cubic interpolation to predict vertex and normal maps

Camera Tracking Calculate world camera position using ICP.
Input: Predicted vertex and normal maps, rigid transform of previous frame 𝑇 𝑔,𝑘−1 Output: Rigid transform matrix 𝑇 𝑔,𝑘 Details: Use all data for pose estimation Assume small motion between frames Parallelizable processing

Camera Tracking contd. Details:
Align current surface measurement maps with predicted surface maps Do this by 3d iterative closest point (ICP) Assume closest points as initial correspondences, then iteratively improve results until convergence Outliers from ICP are tracked

Pipeline Recap Image from Microsoft’s KinectFusion Page

Results Tracking and mapping in constant time
Qualitative results show superior to frame-to-frame matching No explicit expensive global optimization needed t=1m30s

Demo! Can we break it? LET’S FIND OUT.

Assumptions and Limitations
Scene is static (no sustained motion) Scene is fairly contained Initialization determines accuracy If current observed scene doesn’t constrain all DOF of the camera, solving for the camera pose can yield many arbitrary solutions Currently if tracking fails, user is asked to align the camera with the last known good-position

Previous Work Offline multi-camera reconstructions
Real time parametric Real time reconstruction of non- static scenes with one depth camera Fusing new frames bit by bit to one world model (like KinectFusion)

Previous Work contd. Gap to fill?
Templates make it hard to model features in image that may be drastically different (clothing, smaller person, wider person, etc.) Don’t assume small motion between frames Long runs of systems without templates accrue drift and smooth out high frequency information

Fusion4D Overview Builds off of KinectFusion (overlap in authors, too)
Real-time dense reconstruction Temporal coherence Resistant to topology changes and large motion between frames No templates or assumptions about the scene a priori Multi-view setup with many RGB-D cameras

Pipeline

Input and Pre-processing
RGB images provide texture for reconstruction RGB-D frames from IR cameras use PatchMatch Stereo for depth calculations Segmentation of depth maps for ROIs to keep foreground distinct throughout pipeline Use dense Conditional Random Field (machine learning) for neighborhood smoothing

Correspondence Dense correspondence field initializes non-rigid alignment using Global Patch Collider (decision tree based machine learning) Input: Two consecutive image frames Output: Energy term that encourages matched pixels and their corresponding key volume points to line up

Correspondence contd. Details:
Decision tree like in HyperDepth, but normalize by depth term to give scale invariance Goal to find correspondences between pixel positions in two consecutive frames at patch level Take union of trees, then to minimize false positives, use voting (each tree votes for a match)

Non-rigid Alignment: Embedded Deformation Graph
Key volume is a TSDF. How do we calculate a deformation field to warp this key volume to new raw depth maps? Input: Key volume Output: Functions to warp local areas around ED nodes to raw depth maps

Non-rigid Alignment: Embedded Deformation Graph contd.
Details: Use Emedded Deformation (ED) model Uniformly sample k ED nodes in the key volume’s representative mesh Warp the key volume and normals off the mesh in each ED node region using affine transformation and translation and world rotation and translation

Non-rigid Alignment: Alignment Error
Energy function will constrain allowed deformations of the key volume, and will best align model to raw data. Input: Warp function from key volume to raw depth data, raw depth data, correspondence energy term Output: Transformation from key volume to raw depth data that minimizes energy function

Non-rigid Alignment: Alignment Error contd.
Terms of energy function:

Non-rigid Alignment: Alignment Error contd.
Details: Energy function minimization a nonlinear least squares problem Fix ED node affine transformation and translation Use ICP to approximate global motion parameters rotation and translation See if E(X + h) < E(X), where X is all parameters and h is a small step size Use preconditioned conjugate gradient (PCG) to solve system of linear equations (energy function) iteratively

Volumetric Fusion and Blending
Accumulated data improves TSDF model. Maintain both a key volume and data volume. Input: best transformation of key volume that minimizes energy function, data volume Output: reconstructed TSDF

Volumetric Fusion and Blending contd.
Details: Data volume: volume at current frame made from fused depth maps. Key volume: an instance of the reference volume.

Volumetric Fusion and Blending contd.
Details: Selective fusion: Collision Misalignment Fusion and blending steps: Fuse depth maps to make data volume Warp key volume into data frame Blend together data volume and warped key volume

Pipeline Recap

Results Not limited to human subjects because there are no prior assumptions Non-registration (energy function calculations) takes the most time in processing (64% at 20ms) Correspondence results are better than SIFT detector, FAST detector, etc. =4m11s

Assumptions and Limitations
Non-rigid alignment errors Overly smooth model Segmentation errors Visual hull estimate incorrect, so noise is fused into model Reliance on temporal coherence Framerate can’t be low Motion between frames can’t be too big

Similarities Both real time dense reconstruction algorithms
Both use volumetric fusion from Curless and Levoy (1996) Accumulated depth data is used to improve the current world model Both use TSDF to represent their world surfaces Both use ICP to estimate camera’s global rotation and translation parameters Both make it possible to use depth images to obtain a view of the world

Differences KinectFusion Fusion4D Rigid reconstruction
Non-rigid reconstruction No Keyframes Key Volumes Static scenes Non-static scenes

Future Work Non-rigid matching algorithms designed specifically for topology changes 3D streaming of live concerts Reconstruction of larger, more complex scenes

Real Time Dense 3D Reconstructions: KinectFusion (2011) and Fusion4D (2016) Eleanor Tursman.

Similar presentations

Presentation on theme: "Real Time Dense 3D Reconstructions: KinectFusion (2011) and Fusion4D (2016) Eleanor Tursman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Real Time Dense 3D Reconstructions: KinectFusion (2011) and Fusion4D (2016) Eleanor Tursman.

Similar presentations

Presentation on theme: "Real Time Dense 3D Reconstructions: KinectFusion (2011) and Fusion4D (2016) Eleanor Tursman."— Presentation transcript:

Similar presentations

About project

Feedback