Real-time Foreground Extraction with RGBD Camera for 3D Telepresence

Real-time Foreground Extraction with RGBD Camera for 3D Telepresence
Presenter: ZHAO Mengyao, PhD, SCE Supervisor: Asst/P FU Chi-Wing, Philip Co-Supervisor: A/P CAI Jianfei Hello, everyone. My name is ZHAO Mengyao. I am a PhD student of SCE. Today the topic I am going to talk about is Real-time Foreground Extraction with RGBD camera for 3D Telepresence.

Outline Motivation Related Work Challenges Our Approach Results
Limitation Future Work This is today’s outline. First I will explain the motivation, related work, and challenges we have; Then I will present main idea and also some technical details of our approach. At last I will show some results, as well as the limitation and the future work.

3D Telepresence So first, what is 3D telepresence exactly?
It’s a state-of-the-art application where remote collaborators can communicate with each other as if they were co-located at the same place, as illustrated by this figure.

3D Telepresence BeingThere Centre’s RBT Project These images show some 3D telepresence applications we have now or we can expect in near future. This man on the left is not really standing on the stage, this is actually a 3D projection of him. You can still see the blue screen around his head. IBM’s Holographic 3D cell phone

Foreground Extraction
All these 3D telepresence applications require foreground extraction in real-time. That’s why we want to accelerate the progress in this aspect.

Related Work Chroma Key [1, 2]
The most heavily used method of the foreground extraction is chroma key, especially in newscast and film production. It is also known as green screen or blue screen. Because it requires uniform color background, mostly green or blue since they differ the most distinctly with human skin tone. The foreground is captured again the green screen and then all the green color will be replaced by other background. However this method requires complicated setup, e.g. lighting control, avoidance of shadow, and no human should wear similar color with the background. So this is not the best choice.

Related Work Interactive Approach [3-8]
Interactive approaches have long been proved to be effective, for example, intelligent scissors, graph cut and grabcut. They allow users to draw manual markup on the input image, use color to distinguish foreground and background, to initialize the segmentation, or refine the results. These are some example images with markup and the corresponding output. Some of the results are quite good, but of course we can not expect user to draw markup for each frame in a telepresence system, so interactive ways are not appropriate for real-time use.

Related Work Microsoft Kinect [9] PrimeSense Carmine 3D Sensor [10]
Since the emergence of consumer-grade rgbd cameras, for example, Microsoft Kinect and PrimeSense 3D Sensor, depth data are available and convenient for real-time use. Microsoft Kinect [9] PrimeSense Carmine 3D Sensor [10]

Related Work Real-time Foreground Extraction with RGBD camera:
FreeCam [11] Therefore several real-time foreground extraction technique with RGBD camera have been proposed. FreeCam is such a technique, it is a hybrid camera system with multiple high-quality cameras and Kinect that support teleconference/telepresence. The images here show the 3D segmentation result with or without texture. They did a pretty good work. But we found out that most of such real-time foreground extraction systems didn’t tackle the very important challenge, that is the temporal coherency problem. The incoherency around foreground boundary will become serious flickering artifact, and human eyes are extremely sensitive to such problem. This will be detrimental to the user experience in 3D telepresence.

Challenges Convenience High quality Automation Real-time
Arbitrary background High quality Natural & smooth boundary Automation No manual markup Real-time Support teleconference/telepresence Temporal coherency Free of flickering artifact To summary, we have below challenges for a real-time foreground extraction that support telepresence: First, the convenience. We want to avoid the use of uniform-colored background. Then, high quality result, of course, this is very important. And then, automation AND real-time performance. Next, the temporal coherency, which is the biggest problem we want to solve in our work.

Challenges Inaccurate depth map Red: Color Green: Depth
Another problem is the inaccuracy of depth map. You might want to know, since we already have the depth map, why don’t we just threshold the depth to get a direct segmentation? The truth is, that way can’t give a good result. This is a sample frame from kinect, depth on color. From the zoom-in view, I denote the depth boundary and color boundary using red and green color respectively. They obviously have a very big gap, particularly when foreground are moving fast. This is caused by three reasons, that is: depth map itself is noisy, depth/color are not aligned, depth/color stream are not synchronized. Therefore, inaccuracy of depth map is another problem we need to consider. Noisy depth map Depth/Color not well aligned Depth/Color not synchronized Depth on Color

Our Approach We aim to We propose
Perform high-quality coherent foreground extraction in real-time that could support teleconference and telepresence We propose An integrated pipeline for robust foreground extraction with RGBD camera A temporal coherent matting approach A CUDA based GPU implementation of our approach that achieves real-time performance So in one sentence, in this work, we aim to achieve high-quality coherent foreground extraction in real-time that could support teleconference and telepresence. We therefore propose an integrated pipeline for robust foreground extraction with RGBD camera; Also a temporal coherent matting approach; And a CUDA based GPU implementation of our approach that achieves real-time performance.

Our Approach – Matting where i is the index of pixel, I is the intensity, α is the alpha, F is the foreground, B is the background Since our approach is based on the standard closed-form matting, I want to talk about matting a little bit. Matting is an accurate technique to extract foreground, similar to segmentation. But different from segmentation, besides the input image, matting allows you to use white, black, grey to denote foreground, background, and unknown region first. This is called a trimap. Then the output of matting is not a binary mask, but a grey scale mask, so the small part and foreground part that share similar color with background can still be accurately extracted. This is the advantage of matting compared to segmentation. On the right is the formulation of matting. For each pixel, its intensity is denoted by I, and I is assumed to be a linear combination of the corresponding foreground and background color, where alpha is the foreground opacity. Input Trimap Alpha

Our Approach – Workflow
And here comes to our approach. This is a general workflow of our approach. You can see the input, output and intermediate results in the image. First, we will do an offline background modeling for a few seconds so we have a stable background model. Then for each coming frame, there are three main stages: in preprocessing we use depth map shadow detection technique to refine the depth map, in trimap generation stage we adaptively combine the two binary mask from background subtraction and generate the trimap, and in non-local temporal matting we extend the closed-form matting to temporal domain to obtain the alpha map, which is more temporal coherent.

Our Approach – Pipeline
This is a detailed illustration of our proposed pipeline, including the four main stages and technical details. Since time is limited, I will talk about three of them.

First, the temporal hole-filling with depth map shadow detection.

Our Approach – Temporal Hole Filling with Depth Map Shadow Detection
NMD: no-measured depth [12] Black: NMD regions Green: out-of-range regions Yellow: mirror-like regions Red: shadow regions According to one previous work, the NMD region, which is the no-measured depth pixels, of depth map are caused by different reasons. The left image show the raw depth map, lots of NMD region, and the right image denotes the types of NMD regions using three colors. This is another example image. Since the shadow region is really the projection of foreground on background, it can be detected. And it is useful to identify boundary. Conventional hole-filling will fill all the NMD regions in the same way, and this can lose cue for boundary. Raw depth map Detected shadow Types of NMD regions

Our Approach – Temporal Hole Filling with Depth Map Shadow Detection
For shadow region, apply below: For other NMD region, apply joint-bilateral filter So instead of apply a universal hole-filling strategy on all the NMD regions. We propose to treat shadow region and other NMD region differently. For shadow region, since it is the projection of foreground on background. The true depth value should be the background value. It can be computed in a temporal fashion. For other NMD region, we just simply apply JBF. This way we will have a more accurate depth map.

Next I will talk about our adaptive mask generation.

Our Approach – Adaptive Binary Mask Generation
color mask After background subtraction, we have two binary mask from color and depth. In the images you can see, they have different pros and cons. Color mask usually have smooth boundary, but will cause big error when illumination have big change. Depth mask usually have correct shape of the foreground, but sometimes not so smooth around boundary because of noisy data. To simply merge color and depth in a four channel way also doesn’t provide good result. So we propose this adaptive way to generate the new mask according to color and depths’differences. The closer one pixel is to the depth boundary, the more reliable the color mask is; The farther one pixel is to the depth boundary, the more reliable the depth mask is. Therefore we can have a better final mask. final mask depth mask

Next I will talk about our non-local temporal matting.

Our Approach – Non-local Temporal Matting: intro of closed-form matting
Assumption: Both F and B are approximately constant over a small window around each pixel. 1 2 Our approach is based on standard closed-form matting. Closed-form matting is based on the assumption that both foreground and background color are approximately constant over a small window around each pixel. The laplacian matrix can be computed by equation 2, this equation encodes the affinity between pixel and its neighbors. And the alpha map can be obtained by solving equation 3. Alpha can be obtained by solving: ( 3 )

Non-local temporal matting
Our Approach – Non-local Temporal Matting: extension of closed-form matting Closed-form matting Assumption: F and B are smooth in local window Temporal coherency Assumption: F and B are smooth in both spatial and temporal domain Non-local temporal matting So if the assumption of closed-form matting is local smoothness in a small window, how do we achieve temporal coherency? We can extend the assumption to that foreground and background are smooth in both spatial and temporal domain. In this way, we propose non-local temporal matting.

Our Approach – Non-local Temporal Matting: 3d non-local neighbor
It+1 It0 It-1 It0 2D neighbor 3D non-local neighbor is the neighbor of In the laplacian matrix, affinity between each pixel and its neighbor needed to be encoded. In closed-form matting, 2D neighbor is used. Each pixel’s neighbors are all the pixels in its local window. This illustrates the 2D neighbor. The blue cross is the neighbor of the red point. Equation 4 is the formulation. Now in our approach we have 3D non-local neighbor. Each pixel’s neighbors are pixels that are similar to it in both spatial and temporal domain. We can use approximate nearest neighbor algorithm to search these neighbors. Then equation 4 can be reformulated to equation 5. It can also be solved by equation 3. (4) (5) (3)

Our Approach – Non-local Temporal Matting: volume partition
The linear equation system is too large Use kd-tree segmentation to partition the volumes Recursive until number of unknown within each block is smaller than a threshold The next problem is since the linear equation system is too large, it will be hard to solve. So we use kd-tree segmentation to partition the 3d volumes to several blocks. The algorithm is recursive until number of unknown pixels within each block is smaller than a threshold. This is a simple illustration of the partition.

Our Approach – GPU implementation: CUDA and CULA Sparse
Quantitative Performance Stage Average Time (ms/frame) FPS (frame/s) Background Modeling Preprocessing Trimap Generation Temporal Matting Total At last, since our approach is GPU friendly and can be parallel. We implement it on GPU using CUDA and CULA Sparse library. The table shows the quantitative performance of our real-time implementation.

Results Comparison with three other state-of-the-art works
Here I show you some results. We compare our result with three other approaches, background subtraction, closed-form matting, and the FreeCam system. Our result have the best temporal coherency and the least flickering artifacts and outperform other methods.

Limitation Shadow-like region when moving fast
Reason 1: color/depth not well aligned Reason 2: color/depth not synchronized Sometimes inaccurate when foreground/background share similar color The limitation of our approach is sometimes you still can see shadow-like region and inaccurate part. This can be further improved.

Future Work Refine the depth map using attained alpha map to achieve better 3D representation Our future work is to use the accurate alpha map to refine the depth map so that we can demonstrate our result in 3D.

Thank You! Q & A That’s all. Thanks for listening. Any questions?

Real-time Foreground Extraction with RGBD Camera for 3D Telepresence

Similar presentations

Presentation on theme: "Real-time Foreground Extraction with RGBD Camera for 3D Telepresence"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Real-time Foreground Extraction with RGBD Camera for 3D Telepresence

Similar presentations

Presentation on theme: "Real-time Foreground Extraction with RGBD Camera for 3D Telepresence"— Presentation transcript:

Similar presentations

About project

Feedback