Download presentation
Presentation is loading. Please wait.
Published byMoris Webb Modified over 6 years ago
1
Real Time Dense 3D Reconstructions: KinectFusion (2011) and Fusion4D (2016)
Eleanor Tursman
2
Previous Work Monocular SLAM Multiview stereo
Not real-time Monocular SLAM RGB camera, real-time, sparse Pose estimation using iterative closest point (ICP) Active sensors using signed distance function (SDF) for dense reconstruction instead of meshes Feature based monocular SLAM system ORB-SLAM
3
Previous Work contd. Dense, real-time reconstruction using SDF fusion
Gap to fill? Dense, real-time reconstruction using SDF fusion Global reconstruction instead of frame-to-frame Independent of lighting in the scene using structured light
4
KinectFusion Overview
Real time dense reconstruction of static scenes Works on sensors that generate depth maps quickly (30Hz) Implicit surface construction Implicit surface: set of all points that satisfy some f(x,y,z) = 0. Simultaneous Localization and Mapping (SLAM)
5
Pipeline Image from Microsoft’s KinectFusion Page
6
Depth Map Conversion Process raw depth data so that camera’s rigid transform can be calculated Input: raw depth map from Kinect at some time k Output: vertex map and normal map in camera coordinate system
7
Depth Map Conversion contd.
Details: Apply bilateral filter to smooth noise while preserving sharp edges Calculate vertex map and normal map using intrinsic camera parameters Make a 3-level pyramid representation of both maps
8
Volumetric Integration
Fuse current raw depth map into global scene truncated signed distance function (TSDF) camera pose estimation Input: raw depth map, camera pose 𝑇 𝑔,𝑘 Output: global TSDF, rigid transform matrix from previous frame 𝑇 𝑔,𝑘−1
9
Volumetric Integration contd.
Details: TSDF: discrete version of SDF. Further truncate by taking points within some ± μ of the surface in TSDF Parallelizable Nearest neighbor lookup instead of interpolation of depth values to avoid smearing
10
Volumetric Integration contd.
Details: Converge towards SDF Fusion by taking weighted average of TSDFs for every depth map Use raw depth map, not bilaterally filtered map
11
Ray-casting Render a surface prediction made by the zero level set in the global TSDF using the previous camera pose estimation Input: global TSDF, rigid transform of previous frame 𝑇 𝑔,𝑘−1 Output: predicted vertex and normal maps
12
Ray-casting contd. Details:
Raycast TSDF (the current world volume) to render the 3D volume Use ray skipping to accelerate process Simplified version of cubic interpolation to predict vertex and normal maps
13
Camera Tracking Calculate world camera position using ICP.
Input: Predicted vertex and normal maps, rigid transform of previous frame 𝑇 𝑔,𝑘−1 Output: Rigid transform matrix 𝑇 𝑔,𝑘 Details: Use all data for pose estimation Assume small motion between frames Parallelizable processing
14
Camera Tracking contd. Details:
Align current surface measurement maps with predicted surface maps Do this by 3d iterative closest point (ICP) Assume closest points as initial correspondences, then iteratively improve results until convergence Outliers from ICP are tracked
15
Pipeline Recap Image from Microsoft’s KinectFusion Page
16
Results Tracking and mapping in constant time
Qualitative results show superior to frame-to-frame matching No explicit expensive global optimization needed t=1m30s
17
Demo! Can we break it? LET’S FIND OUT.
18
Assumptions and Limitations
Scene is static (no sustained motion) Scene is fairly contained Initialization determines accuracy If current observed scene doesn’t constrain all DOF of the camera, solving for the camera pose can yield many arbitrary solutions Currently if tracking fails, user is asked to align the camera with the last known good-position
19
Previous Work Offline multi-camera reconstructions
Real time parametric Real time reconstruction of non- static scenes with one depth camera Fusing new frames bit by bit to one world model (like KinectFusion)
20
Previous Work contd. Gap to fill?
Templates make it hard to model features in image that may be drastically different (clothing, smaller person, wider person, etc.) Don’t assume small motion between frames Long runs of systems without templates accrue drift and smooth out high frequency information
21
Fusion4D Overview Builds off of KinectFusion (overlap in authors, too)
Real-time dense reconstruction Temporal coherence Resistant to topology changes and large motion between frames No templates or assumptions about the scene a priori Multi-view setup with many RGB-D cameras
22
Pipeline
23
Input and Pre-processing
RGB images provide texture for reconstruction RGB-D frames from IR cameras use PatchMatch Stereo for depth calculations Segmentation of depth maps for ROIs to keep foreground distinct throughout pipeline Use dense Conditional Random Field (machine learning) for neighborhood smoothing
24
Correspondence Dense correspondence field initializes non-rigid alignment using Global Patch Collider (decision tree based machine learning) Input: Two consecutive image frames Output: Energy term that encourages matched pixels and their corresponding key volume points to line up
25
Correspondence contd. Details:
Decision tree like in HyperDepth, but normalize by depth term to give scale invariance Goal to find correspondences between pixel positions in two consecutive frames at patch level Take union of trees, then to minimize false positives, use voting (each tree votes for a match)
26
Non-rigid Alignment: Embedded Deformation Graph
Key volume is a TSDF. How do we calculate a deformation field to warp this key volume to new raw depth maps? Input: Key volume Output: Functions to warp local areas around ED nodes to raw depth maps
27
Non-rigid Alignment: Embedded Deformation Graph contd.
Details: Use Emedded Deformation (ED) model Uniformly sample k ED nodes in the key volume’s representative mesh Warp the key volume and normals off the mesh in each ED node region using affine transformation and translation and world rotation and translation
28
Non-rigid Alignment: Alignment Error
Energy function will constrain allowed deformations of the key volume, and will best align model to raw data. Input: Warp function from key volume to raw depth data, raw depth data, correspondence energy term Output: Transformation from key volume to raw depth data that minimizes energy function
29
Non-rigid Alignment: Alignment Error contd.
Terms of energy function:
30
Non-rigid Alignment: Alignment Error contd.
Details: Energy function minimization a nonlinear least squares problem Fix ED node affine transformation and translation Use ICP to approximate global motion parameters rotation and translation See if E(X + h) < E(X), where X is all parameters and h is a small step size Use preconditioned conjugate gradient (PCG) to solve system of linear equations (energy function) iteratively
31
Volumetric Fusion and Blending
Accumulated data improves TSDF model. Maintain both a key volume and data volume. Input: best transformation of key volume that minimizes energy function, data volume Output: reconstructed TSDF
32
Volumetric Fusion and Blending contd.
Details: Data volume: volume at current frame made from fused depth maps. Key volume: an instance of the reference volume.
33
Volumetric Fusion and Blending contd.
Details: Selective fusion: Collision Misalignment Fusion and blending steps: Fuse depth maps to make data volume Warp key volume into data frame Blend together data volume and warped key volume
34
Pipeline Recap
35
Results Not limited to human subjects because there are no prior assumptions Non-registration (energy function calculations) takes the most time in processing (64% at 20ms) Correspondence results are better than SIFT detector, FAST detector, etc. =4m11s
36
Assumptions and Limitations
Non-rigid alignment errors Overly smooth model Segmentation errors Visual hull estimate incorrect, so noise is fused into model Reliance on temporal coherence Framerate can’t be low Motion between frames can’t be too big
37
Similarities Both real time dense reconstruction algorithms
Both use volumetric fusion from Curless and Levoy (1996) Accumulated depth data is used to improve the current world model Both use TSDF to represent their world surfaces Both use ICP to estimate camera’s global rotation and translation parameters Both make it possible to use depth images to obtain a view of the world
38
Differences KinectFusion Fusion4D Rigid reconstruction
Non-rigid reconstruction No Keyframes Key Volumes Static scenes Non-static scenes
39
Future Work Non-rigid matching algorithms designed specifically for topology changes 3D streaming of live concerts Reconstruction of larger, more complex scenes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.