Indoor Scene Segmentation using a Structured Light Sensor

Slides:

Advertisements

Similar presentations

Indoor Segmentation and Support Inference from RGBD Images Nathan Silberman, Derek Hoiem, Pushmeet Kohli, Rob Fergus.

Advertisements

3rd Workshop On Semantic Perception, Mapping and Exploration (SPME) Karlsruhe, Germany,2013 Semantic Parsing for Priming Object Detection in RGB-D Scenes.

Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.

CSE473/573 – Stereo and Multiple View Geometry

RGB-D object recognition and localization with clutter and occlusions Federico Tombari, Samuele Salti, Luigi Di Stefano Computer Vision Lab – University.

Scene Labeling Using Beam Search Under Mutex Constraints ID: O-2B-6 Anirban Roy and Sinisa Todorovic Oregon State University 1.

SPONSORED BY SA2014.SIGGRAPH.ORG Annotating RGBD Images of Indoor Scenes Yu-Shiang Wong and Hung-Kuo Chu National Tsing Hua University CGV LAB.

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.

LARGE-SCALE IMAGE PARSING Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill road building car sky.

MIT CSAIL Vision interfaces Approximate Correspondences in High Dimensions Kristen Grauman* Trevor Darrell MIT CSAIL (*) UT Austin…

Patch to the Future: Unsupervised Visual Prediction

Automatic scene inference for 3D object compositing Kevin Karsch (UIUC), Sunkavalli, K. Hadap, S.; Carr, N.; Jin, H.; Fonte, R.; Sittig, M., David Forsyth.

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Atul Kanaujia, CBIM, Rutgers Cristian Sminchisescu, TTI-C Dimitris Metaxas,CBIM, Rutgers.

Recognition using Regions CVPR Outline Introduction Overview of the Approach Experimental Results Conclusion.

Unsupervised Learning of Categorical Segments in Image Collections *California Institute of Technology **Technion Marco Andreetto*, Lihi Zelnik-Manor**,

Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.

LARGE-SCALE NONPARAMETRIC IMAGE PARSING Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill CVPR 2011Workshop on Large-Scale.

Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.

Spatial Pyramid Pooling in Deep Convolutional

Opportunities of Scale, Part 2 Computer Vision James Hays, Brown Many slides from James Hays, Alyosha Efros, and Derek Hoiem Graphic from Antonio Torralba.

Large Scale Recognition and Retrieval. What does the world look like? High level image statistics Object Recognition for large-scale search Focus on scaling.

Survey of Object Classification in 3D Range Scans

Distinctive Image Features from Scale-Invariant Keypoints By David G. Lowe, University of British Columbia Presented by: Tim Havinga, Joël van Neerbos.

Unsupervised Learning of Categories from Sets of Partially Matching Image Features Kristen Grauman and Trevor Darrel CVPR 2006 Presented By Sovan Biswas.

EADS DS / SDC LTIS Page 1 7 th CNES/DLR Workshop on Information Extraction and Scene Understanding for Meter Resolution Image – 29/03/07 - Oberpfaffenhofen.

KinectFusion : Real-Time Dense Surface Mapping and Tracking IEEE International Symposium on Mixed and Augmented Reality 2011 Science and Technology Proceedings.

Xiaoguang Han Department of Computer Science Probation talk – D Human Reconstruction from Sparse Uncalibrated Views.

Building local part models for category-level recognition C. Schmid, INRIA Grenoble Joint work with G. Dorko, S. Lazebnik, J. Ponce.

Object Stereo- Joint Stereo Matching and Object Segmentation Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Michael Bleyer Vienna.

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

Svetlana Lazebnik, Cordelia Schmid, Jean Ponce

Computer Vision: Summary and Discussion Computer Vision ECE 5554 Virginia Tech Devi Parikh 11/21/

Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)

The University of Texas at Austin Vision-Based Pedestrian Detection for Driving Assistance Marco Perez.

In Defense of Nearest-Neighbor Based Image Classification Oren Boiman The Weizmann Institute of Science Rehovot, ISRAEL Eli Shechtman Adobe Systems Inc.

MSRI workshop, January 2005 Object Recognition Collected databases of objects on uniform background (no occlusions, no clutter) Mostly focus on viewpoint.

Computer Vision 776 Jan-Michael Frahm 12/05/2011 Many slides from Derek Hoiem, James Hays.

The 18th Meeting on Image Recognition and Understanding 2015/7/29 Depth Image Enhancement Using Local Tangent Plane Approximations Kiyoshi MatsuoYoshimitsu.

Peter Henry1, Michael Krainin1, Evan Herbst1,

Journal of Visual Communication and Image Representation

CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.

Goggle Gist on the Google Phone A Content-based image retrieval system for the Google phone Manu Viswanathan Chin-Kai Chang Ji Hyun Moon.

1.Learn appearance based models for concepts 2.Compute posterior probabilities or Semantic Multinomial (SMN) under appearance models. -But, suffers from.

Learning Hierarchical Features for Scene Labeling

DISCRIMINATIVELY TRAINED DENSE SURFACE NORMAL ESTIMATION ANDREW SHARP.

Announcements Final is Thursday, March 18, 10:30-12:20 –MGH 287 Sample final out today.

Image segmentation.

Gaussian Conditional Random Field Network for Semantic Segmentation

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Recent developments in object detection

- photometric aspects of image formation gray level images

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Nonparametric Semantic Segmentation

Saliency detection Donghun Yeo CV Lab..

Paper Presentation: Shape and Matching

CS 1674: Intro to Computer Vision Scene Recognition

CVPR 2014 Orientational Pyramid Matching for Recognizing Indoor Scenes

Data-driven Depth Inference from a Single Still Image

RGB-D Image for Scene Recognition by Jiaqi Guo

Filtering Things to take away from this lecture An image as a function

SIFT keypoint detection

Outline Background Motivation Proposed Model Experimental Results

Filtering An image as a function Digital vs. continuous images

Human-object interaction

Sign Language Recognition With Unsupervised Feature Learning

Presentation transcript:

Indoor Scene Segmentation using a Structured Light Sensor Nathan Silberman and Rob Fergus ICCV 2011 Workshop on 3D Representation and Recognition Courant Institute

Overview Indoor Scene Recognition using the Kinect Introduce new Indoor Scene Depth Dataset Describe CRF-based model Explore the use of rgb/depth cues

Motivation Indoor Scene recognition is hard Far less texture than outdoor scenes More geometric structure

Motivation Indoor Scene recognition is hard Far less texture than outdoor scenes More geometric structure Kinect gives us depth map (and RGB) Direct access to shape and geometry information

Overview Indoor Scene Recognition using the Kinect Introduce new Indoor Scene Depth Dataset Describe CRF-based model Explore the use of rgb/depth cues

Capturing our Dataset We wanted a dataset that could be used for scene understanding rather than just detection Kinect normally has AC Adapter Replaced with battery Open Source Drivers used with mouse to record data

Statistics of the Dataset Scene Type Number of Scenes Frames Labeled Frames * Bathroom 6 5,588 76 Bedroom 17 22,764 480 Bookstore 3 27,173 784 Cafe 1 1,933 48 Kitchen 10 12,643 285 Living Room 13 19,262 355 Office 14 19,254 319 Total 64 108,617 2,347 Fix the names of columns. Talk them through the columns. DENSELY LABELED!!!! - Mention labelme Most of these are collected from friends apts in new york Each image is 480x640 Each pixel is a 5-tuple: RGB (3) Depth (1) Label (1) * Labels obtained via LabelMe

Dataset Examples Living Room Explain the noise RGB Raw Depth Labels

Dataset Examples Living Room RGB Depth* Labels Add note about bilateral filtering use to fill in holes RGB Depth* Labels * Bilateral Filtering used to clean up raw depth image

Dataset Examples Bathroom RGB Depth Labels WHEN WE FILL IN THE HOLES, ADD SOME TEXT ABOUT PROJECTION + BILATERAL FILTER RGB Depth Labels

Dataset Examples Bedroom RGB Depth Labels

Existing Depth Datasets RGB-D Dataset [1] Most depth datasets are very small, a few images, but there are a few bigger ones RGBD for small objects only, similar to COIL Uses kinect Make3d is not densely labeled and is all outdoors B3do Uses kinect, is really for detection, not scene understanding has 849 labeled frames, 75 scenes, around 50 objects [1] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. ICRA 2011 [2] B. Liu, S. Gould and D. Koller. Single Image Depth Estimation from Predicted Semantic Labels. CVPR 2010 Stanford Make3d [2]

Existing Depth Datasets PCD has 52 scenes Point Cloud Data [1] B3DO [2] [1] Abhishek Anand, Hema Swetha Koppula, Thorsten Joachims, Ashutosh Saxena. Semantic Labeling of 3D Point Clouds for Indoor Scenes. NIPS, 2011 [2] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell. A Category-Level 3-D Object Dataset: Putting the Kinect to Work. ICCV Workshop on Consumer Depth Cameras for Computer Vision. 2011

Dataset Freely Available http://cs.nyu.edu/~silberman/nyu_indoor_scenes.html Images - code Labels Splits Emphasis the raw data is the raw data streams

Overview Indoor Scene Recognition using the Kinect Introduce new Indoor Scene Depth Dataset Describe CRF-based model Explore the use of rgb/depth cues

Segmentation using CRF Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) Use phrase cost function (energy) Standard CRF, talk about costs. Now we’re going to go into detail with regard to each term, the alternative choices we can make for each term, and we’ll talk about evaluation last. Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness Standard CRF formulation Optimized via graph cuts Discrete label set (~12 classes)

Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness

Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness

Appearance Term Appearance(label i | descriptor i) Several Descriptor Types to choose from: RGB-SIFT Depth-SIFT Depth-SPIN RGBD-SIFT RGB-SIFT/D-SPIN We’ll evaluate these alternatives later.

Descriptor Type: RGB-SIFT RGB image from the Kinect 128 D Say STANDARD SIFT Extracted Over Discrete Grid

Descriptor Type: Depth-SIFT Depth image from kinect with linear scaling 128 D Pixel intensity is proportional to depth Talk about why depth sift is a good idea -large scale shape information -small magnitude directional gradient information – essentially surface normals Extracted Over Discrete Grid

Descriptor Type: Depth-SPIN Depth image from kinect with linear scaling Radius 50 D Same sift features, but done over linear scaled depth Talk about why depth sift is a good idea -large scale shape information -small magnitude directional gradient information – essentially surface normals Depth Extracted Over Discrete Grid A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE PAMI, 21(5):433–449, 1999

Descriptor Type: RGBD-SIFT RGB image from the Kinect Concatenate Depth image from kinect with linear scaling 256 D

Descriptor Type: RGD-SIFT, D-SPIN RGB image from the Kinect Concatenate Depth image from kinect with linear scaling 178 D

Descriptor at each location Appearance Model Appearance(label i | descriptor i) - Modeled by a Neural Network with a single hidden layer Talk through this in animated way. Last animation is backprop with curly arrow going down Descriptor at each location

Descriptor at each location Appearance Model Appearance(label i | descriptor i) Softmax output layer 13 Classes Talk through this in animated way. Last animation is backprop with curly arrow going down 1000-D Hidden Layer 128/178/256-D Input Descriptor at each location

Appearance Model Appearance(label i | descriptor i) Interpreted as p(label | descriptor) Probability Distribution over classes 13 Classes Talk through this in animated way. Last animation is backprop with curly arrow going down 1000-D Hidden Layer 128/178/256-D Input Descriptor at each location

Appearance Model Appearance(label i | descriptor i) Probability Distribution over classes 13 Classes Trained with backpropagation Talk through this in animated way. Last animation is backprop with curly arrow going down 1000-D Hidden Layer 128/178/256-D Input Descriptor at each location

Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness

Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness

Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness 2D Priors 3D Priors

Location Priors: 2D 2D Priors are histograms of P(class, location) Smoothed to avoid image-specific artifacts Check on the math!!!

Motivation: 3D Location Priors 2D Priors don’t capture 3d geomety 3D Priors can be built from depth data Rooms are of different shapes and sizes, how do we align them? The idea of a 3d prior is mean to capture high degree of regularity…

Motivation: 3D Location Priors To align rooms, we’ll use a normalized cylindrical coordinate system: Show that we are dividing each scanline by its max depth to give unit distance Band of maximum depths along each vertical scanline

Relative Depth Distributions Table Television Bed Wall Talk through the images better Explain that the cylinder gives relative depth Density 1 1 Relative Depth

Location Priors: 3D Each column is a bin of relative depth 3 dimensions: Explain binning! - Log Depth Explain Height and angle

Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness 2D Priors 3D Priors

Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness Penalty for adjacent labels disagreeing (Standard Potts Model)

Spatial Modulation of Smoothness Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness In practice they don’t make a huge diff Spatial Modulation of Smoothness None RGB Edge Depth Edges RGB + Depth Edges Superpixel Edges Superpixel + RGB Edges Superpixel + Depth Edges

Experimental Setup 60% Train (~1408 images) 40% Test (~939 images) 10 fold cross validation Images of the same scene cannot appear apart Performance criteria is pixel-level classification (mean diagonal of confusion matrix) 12 most common classes, 1 background class (from the rest)

Evaluating Descriptors Percent 5% gain using RGBD over RGB, 7% over just depth Talk over what the blue bars are and what the red bars ar 2D Descriptors 3D Descriptors

Evaluating Location Priors Percent Talk over the actual percentage boosts 2D Descriptors 3D Descriptors

2nd column is standard RGB, 3rd is leveraging the depth cues

Point out the failure on top, still somewhat unreliable, still doesn’t understand 3d structure of objects. Absolute numbers may be low but many, many objects in indoor scene,

Conclusion Kinect Depth signal helps scene parsing Still a long way from great performance Shown standard approaches on RGB-D data. Lots of potential for more sophisticated methods. No complicated geometric reasoning http://cs.nyu.edu/~silberman/nyu_indoor_scenes.html

Preprocessing the Data We use open source calibration software [1] to infer: Parameters of RGB & Depth cameras Homography between cameras. Depth and RGB are not aligned Missing pixels due to Shadow casued by displacement between infrared emmitter and camera Dark/specular surfaces Random noise Depth values are between 0 and 65,536 We need to invert the depth [1] N. Burrus. Kinect RGB Demo v0.4.0. http://nicolas.burrus.name/index.php/Research/KinectRgbDemoV4?from=Research.KinectRgbDemoV2, Feb. 2011

Preprocessing the data Bilateral filter used to diffuse depth across regions of similar RGB intensity Naïve GPU implementation runs in ~100 ms

Motivation Results from Spatial Pyramid-based classification [1] using 5 indoor scene types. Contrast this with the 81% received by [1] on a 13-class (mostly outdoor) scene dataset. They note similar confusion within indoor scenes. [1] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene CategoriesS. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006