Download presentation
1
High-Quality Video View Interpolation
Larry Zitnick Interactive Visual Media Group Microsoft Research
2
3D video Image centric Geometry centric
Figure from Kang/Szeliski/Anandan ICIP paper. We will be developing an imaging model that captures this spectrum and permits easy use of all these techniques. The important thing to understand is that finding a common platform to accommodate this entire spectrum gains us the flexibility to make use of each technique represented in the spectrum and the efficiency to mix the representations without a performance penalty. Fixed geometry View-dependent texture View-dependent geometry Sprites with depth Layered depth Image Lumigraph Light field Polygon rendering + texture mapping Warping Interpolation
3
Current practice free viewpoint video Many cameras vs. Motion Jitter
4
Current practice free viewpoint video Many cameras vs. Motion Jitter
5
Video view interpolation
Fewer cameras and Smooth Motion Automatic Real-time rendering
6
System overview Video Capture Video Capture Stereo Representation
OFFLINE Stereo Representation Compression File Our rendering system consists of offline and online components. The video capture and processing are done offline. Video processing consists of stereo computation and data compression. The dynamic scene can then be interactively viewed by selectively decompressing the data file and rendering it. ONLINE Selective Decompression Render
7
cameras cameras hard disks controlling laptop concentrators
Our video capture system consists of 8 cameras, each with a resolution of 1024 by 768 capturing at 15 frames per second. Each group of 4 cameras is synchronized using a device called a concentrator, which pipes all the uncompressed video data to a bank of hard disks via a fiber optic cable. The two concentrators are themselves synchronized, and are controlled by a single laptop.
8
Calibration Zhengyou Zhang, 2000
9
Input videos
10
System overview Video Capture Video Capture Stereo Stereo
OFFLINE Stereo Stereo Representation Compression File Our rendering system consists of offline and online components. The video capture and processing are done offline. Video processing consists of stereo computation and data compression. The dynamic scene can then be interactively viewed by selectively decompressing the data file and rendering it. ONLINE Selective Decompression Render
11
Key to view interpolation: Geometry
Stereo Geometry Image 1 Image 2 Camera 1 Camera 2 Virtual Camera
12
Image correspondence Image 1 Image 2 Leg Correct Match Score Incorrect
Good Bad Incorrect Match Score Match Score Match Score Wall
13
Why segments? Better delineation of boundaries.
14
Why segments? Larger support for matching.
Handle gain and offset differences without global model (Kim, Kolmogorov and Zabih, 2003.)
15
Why segments? More efficient. 786,432 pixels vs. 1000 segments
Compute disparities per segment rather than per pixel.
16
Segmentation Many methods will work:
Graph-based (Felzenszwalb and Huttenlocher, 2004) Mean Shift (Comaniciu, et al. 2001) Min-cut (Boykov et al. 2001) Others…
17
Segmentation: Important properties
Not too large, not too small… As large as possible while not spanning multiple objects.
18
Segmentation: Important properties
Stable Regions
19
Segmentation: Our Approach
First average… …then segment. Anisotropic smoothing
20
Segmentation: Result Close-up
21
Matching segments Many measures will work:
SSD Normalized correlation Mutual information Depends on color balancing and image quality.
22
Matching segments: Important properties
Never remove correct matches. Remove as many false matches as possible Use global methods to remove remaining false positives.
23
Matching segments: Our approach
Create gain histogram 0.8 1.25 Good match 0.8 1.25 Bad match
24
Local matching Image 1 Image 2 Low texture
25
Global regularization
Create MRF (Markov Random Field): Image 1 Image 2 A F E D C B R P Q S T U Number of states = number of depth levels Each segment is a node
26
Global regularization
Likelihood (data term) Prior (regularization term) Disparity Images
27
Global regularization
Image 1 Image 2 A F E D C B R P Q S T U colorA ≈ colorB → zA ≈ zB
28
Global regularization
Variance – % of border and similarity of color Normal distribution A F E D C B Disparity
29
Multiple disparity maps
Compute a disparity map for each image. We want the disparity maps to be consistent across images…
30
Consistent disparities
Image 1 Image 2 A F E D C B R P Q A S T U zA ≈ zP, zQ, zS
31
Consistent disparities
Disparities dependent on neighboring disparities. Likelihood includes neighboring disparities.
32
Consistent disparities
Use original data term if not occluded. Bias disparities to lie behind known surfaces when occluded. if not occluded if occluded
33
Is the segment occluded?
Ii Occluded Not occluded
34
If occluded… Disparity Ii Occluded
35
Iteratively solve MRF
36
Depth through time
37
Matting Interpolated view without matting Bayesian Matting
Background Surface Interpolated view without matting Foreground Surface Background Background Alpha Strip Width Foreground Foreground Bayesian Matting Chuang et al. 2001 Camera
38
Rendering with matting
No Matting Matting
39
System overview Video Capture Stereo Stereo Representation
OFFLINE Stereo Stereo Representation Representation Compression File Our rendering system consists of offline and online components. The video capture and processing are done offline. Video processing consists of stereo computation and data compression. The dynamic scene can then be interactively viewed by selectively decompressing the data file and rendering it. ONLINE Selective Decompression Render
40
Representation Main Boundary Strip Width Boundary Layer: Main Layer:
Background Boundary Strip Width Foreground Boundary Layer: Color Depth Alpha Main Layer: Color Depth
41
System overview Video Capture Stereo Representation Representation
OFFLINE Stereo Representation Representation Compression Compression File Our rendering system consists of offline and online components. The video capture and processing are done offline. Video processing consists of stereo computation and data compression. The dynamic scene can then be interactively viewed by selectively decompressing the data file and rendering it. ONLINE Selective Decompression Render
42
Compression Camera 1 Camera 2 Camera 3 Camera 4 Time = 0 Time = 1
Compression is used to reduce the large data-set to a manageable size and to allow fast playback from disk. We developed our own codec to make use of both temporal and between-camera redundancy. Temporal prediction is used to compress the reference camera’s data in terms of previously decoded results from an earlier frame time. Spatial prediction makes use of the reference camera’s disparity map to transform its texture and disparity data into the viewpoint of a spatially adjacent camera. The differences between predicted and actual images are coded using a novel transform-based compression scheme which can simultaneously handle texture, disparity and alpha-map data. To obtain real-time interactivity, the overall decoding scheme is highly optimized for speed and makes use of the GPU where possible. Camera 4 Time = 0 Time = 1
43
Compression Camera 1 Temporal Prediction Camera 2 Camera 3 Camera 4
Compression is used to reduce the large data-set to a manageable size and to allow fast playback from disk. We developed our own codec to make use of both temporal and between-camera redundancy. Temporal prediction is used to compress the reference camera’s data in terms of previously decoded results from an earlier frame time. Spatial prediction makes use of the reference camera’s disparity map to transform its texture and disparity data into the viewpoint of a spatially adjacent camera. The differences between predicted and actual images are coded using a novel transform-based compression scheme which can simultaneously handle texture, disparity and alpha-map data. To obtain real-time interactivity, the overall decoding scheme is highly optimized for speed and makes use of the GPU where possible. Camera 4 Time = 0 Time = 1
44
Compression Camera 1 Camera 2 Camera 3 Spatial Prediction Camera 4
Compression is used to reduce the large data-set to a manageable size and to allow fast playback from disk. We developed our own codec to make use of both temporal and between-camera redundancy. Temporal prediction is used to compress the reference camera’s data in terms of previously decoded results from an earlier frame time. Spatial prediction makes use of the reference camera’s disparity map to transform its texture and disparity data into the viewpoint of a spatially adjacent camera. The differences between predicted and actual images are coded using a novel transform-based compression scheme which can simultaneously handle texture, disparity and alpha-map data. To obtain real-time interactivity, the overall decoding scheme is highly optimized for speed and makes use of the GPU where possible. Camera 4 Time = 0 Time = 1
45
Spatial prediction Depth and Texture Reference Camera Predicted Camera
46
Spatial prediction Depth and Texture Warped Reference Camera Predicted
47
Spatial prediction Warped Depth and Texture Error Signal _ + Reference
Camera Error Signal _ Predicted Camera +
48
Spatial prediction Warped Depth and Texture
Reference Camera Reconstructed (after error signal is added) Predicted Camera
50
Boundary layer coding Depth Color Texture Alpha Matte
Use our own shape coding method similar to MPEG-4
51
System overview Video Capture Stereo Representation Compression
OFFLINE Stereo Representation Compression Compression File File Our rendering system consists of offline and online components. The video capture and processing are done offline. Video processing consists of stereo computation and data compression. The dynamic scene can then be interactively viewed by selectively decompressing the data file and rendering it. ONLINE Selective Decompression Selective Decompression Render Render
52
Rendering Source Cameras
Next we describe the process of using the GPU to render a novel viewpoint from the compressed data. Given a novel viewpoint <highlight virtual camera> the rendering program determines the two nearest cameras <highlight two nearest>. The data from these two cameras is blended to create the new viewpoint. A block diagram of the rendering process is shown here <pause>.
53
Rendering Composite Project Boundary Layer Project Boundary Layer
Main Layer Project Main Layer Next we describe the process of using the GPU to render a novel viewpoint from the compressed data. Given a novel viewpoint <highlight virtual camera> the rendering program determines the two nearest cameras <highlight two nearest>. The data from these two cameras is blended to create the new viewpoint. A block diagram of the rendering process is shown here <pause>.
54
Rendering the main layer (Step 1)
Depth Color Video of background depth Video of background color Projected Color Buffer At every frame time the background depth map is used to create a dense mesh which is texture mapped with the background color map. The mesh is projected into the virtual cameras view. A pixel shader program rejects any pixels that are at large depth gradients. Vertex Shader Pixel Shader Position, Texture Coord GPU Z-Buffer
57
Rendering the main layer (Step 2)
Depth Projected Color Buffer Locate Depth Discontinuities Need to remove main layer triangles connecting background to foreground (1 pixel wide). Avoid modifying the mesh on a frame by frame basis. Set Z-buffer to far away and colors to transparent for boundary. Generate Erase Mesh Pixel Shader CPU GPU Z-Buffer
58
Rendering boundary layer
Depth Boundary RGBA Projected Main Layer Projected Color Buffer Vertex Colors At every frame time the background depth map is used to create a dense mesh which is texture mapped with the background color map. The mesh is projected into the virtual cameras view. A pixel shader program rejects any pixels that are at large depth gradients. Generate Boundary Mesh Compositing CPU GPU Z-Buffer
59
Graphics for Vision Use the GPU for vision.
Real-time stereo – (Yang and Pollefeys, CVPR 03)
60
Rendering Composite Project Boundary Layer Project Boundary Layer
Main Layer Project Main Layer Next we describe the process of using the GPU to render a novel viewpoint from the compressed data. Given a novel viewpoint <highlight virtual camera> the rendering program determines the two nearest cameras <highlight two nearest>. The data from these two cameras is blended to create the new viewpoint. A block diagram of the rendering process is shown here <pause>.
61
Compositing views Pixel Shader Camera 1 Camera 2 Final composite
Weights based on proximity to virtual viewpoint Final composite Final Result Normalization Pixel Shader GPU
62
DEMO Running in real time on a xxx machine. Pause, interpolate, with without playback. Decompressed and rendered in real time. 640x480 x N frames = 300MB.
63
“Massive Arabesque” videoclip
64
Future work Mesh simplification More complicated scenes
Temporal interpolation (use optical flow) Wider range of virtual motion 2D grid of cameras
65
Summary Sparse camera configuration High-quality depth recovery
Automatic matting New two-layer representation Inter-camera compression Real-time rendering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.