A Grand Unifying Architecture for Scene Understanding Marc Eder March 23, 2016 Eigen, David, and Rob Fergus. “Predicting depth, surface normals and semantic.

A Grand Unifying Architecture for Scene Understanding Marc Eder March 23, 2016 Eigen, David, and Rob Fergus. “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture.” Proceedings of the IEEE International Conference on Computer Vision. 2015.

Attributed (unverifiably) to Albert Einstein

The 4 C’s of Scene Understanding Content – What is in the scene? Composition – How is the content laid out? Configuration – What is the scene’s spatial layout? Context – What’s the past/present/future of the scene?

One Ring to Rule Them All Eigen and Fergus propose a single network for multiple understanding tasks [1] Architecture has state-of-the-art performance on 3 out of 4 C’s ▫Doesn’t address scene context Earlier version scored top results for depth map estimation from RGB and surface normal estimation from RGB in “Reconstruction Meets Recognition Challenge” at ECCV 2014 [2]Reconstruction Meets Recognition Challenge

(Before neural networks)

Detecting Content Ex. object detection, identification, recognition Templates and other appearance-based models ▫(1991) Turk, Matthew, and Alex Pentland. "Eigenfaces for recognition.“ Low-level feature-based approaches ▫(1999) Lowe, David G. "Object recognition from local scale- invariant features." ▫(2001) Viola, Paul, and Michael Jones. "Rapid object detection using a boosted cascade of simple features.“ Intermediate-feature-based methods ▫(2008) Felzenszwalb, Pedro, et al. "A discriminatively trained, multiscale, deformable part model." ▫(2009) Kumar, Neeraj, et al. "Attribute and simile classifiers for face verification."

Determining Scene Composition Ex. Object localization, semantic segmentation Bottom-up, low-level feature approaches ▫(1985) Haralick, Robert M., and Linda G. Shapiro. “Image segmentation techniques.” ▫(1999) Comaniciu, Dorin, and Peter Meer. “Mean shift analysis and applications.” Top-down, Gestalt-inspired segmentation ▫(1997/2000) Shi, Jianbo, and Jitendra Malik. “Normalized cuts and image segmentation.” ▫(2003) Ren, Xiaofeng, and Jitendra Malik. “Learning a classification model for segmentation.” Joint inference (things and stuff) ▫(2004) Torralba, Antonio, Kevin P. Murphy, and William T. Freeman. “Contextual models for object detection using boosted random fields.” ▫(2013) Tighe, Joseph, and Svetlana Lazebnik. “Finding things: Image parsing with regions and per-exemplar detectors.”

Estimating Scene Configuration Ex. 3D structure, depth estimation Robust estimation ▫(1981) Fischler, Martin A., and Robert C. Bolles. "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.“ Leveraging multiple views ▫(1987) Longuet-Higgins, H. Christopher. "A computer algorithm for reconstructing a scene from two projections." ▫(2004) Nistér, David. "An efficient solution to the five-point relative pose problem.“ Structure from Motion (SfM) and dozens of other applications ▫Hartley, Richard, and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.

(Where neural networks solve all of your problems) LeNet from http://deeplearning.net/tutorial/lenet.html ImageNet from http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf R-CNN from Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

Convolutional Neural Networks Detecting Content ▫(2014) Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." Determining Scene Composition ▫(2014) Girshick, Ross, et al. “Rich feature hierarchies for accurate object detection and semantic segmentation.” Estimating Scene Configuration ▫(2015) Flynn, John, et al. "DeepStereo: Learning to Predict New Views from the World's Imagery."

“If you want something new, you have to stop doing something old.” Peter Drucker

Single Multi-Purpose Architecture Pixel-map regression is a common task for many applications Shifts development focus to defining proper training set and cost function Shares information across modalities ▫e.g. (Segmentation OR Depth) vs. (Segmentation AND Depth)

“Study the past if you wish to divine the future” - Confucious

Single-Image Depth Prediction Rooted in stereo depth estimation ▫Deterministic ▫Requires multiples views with static scenes, proper baseline, right amount of overlap, … Monocular depth is hard ▫Non-deterministic ▫No verification ▫Must handle both local and global scale ambiguity

A Long Long Time Ago… (2005) Saxena, Ashutosh, Sung H. Chung, and Andrew Y. Ng. “Learning depth from single monocular images.” [3] Used textural and haze features in an MRF to estimate depth maps Premise: ▫“The depth of a particular patch depends on the features of the patch, but is also related to the depths of other parts of the image” 9 Laws energy and 6 oriented gradient filters Filter images from [3]

A Long Long Time Ago… RESULTS Column 1: Original image Column 2: Ground truth Column 3: Gaussian model prediction Column 4: Laplacian model prediction (more computationally efficient due to linear programming) Column 1Column 2Column 3Column 4 Image from [3]

Recent History (2014) Eigen, David, Christian Puhrsch, and Rob Fergus. “Depth map prediction from a single image using a multi-scale deep network.” [2] Direct precursor to [1] Won the “Reconstruction Meets Recognition Challenge” in depth map and surface normal estimation at ECCV 2014Reconstruction Meets Recognition Challenge First introduced the multi-scale approach used in [1]

Recent History Coarse Network Convolutions and max pooling reduce spatial dimension of global image information Useful for network to learn vanishing points, alignment, and object locations Final layers fully connected to contain full image but at ¼ scale Fine Network Refines coarse output Each unit only operates over a patch of the scene Returns depth map at ¼ scale Image from [2]

Recent History

RESULTS Top – NYUDepth v2 [4] Bottom – KITTI [5] A: Input image B: Coarse network output C: Fine network output D: Ground truth A B C D ABCD Image from [2]

(Or at least December 2015)

Predicting Depth, Normals, and Semantic Segmentation Generalization of architecture from 2014 paper [2] Add an extra scale to the pipeline More convolutional layers Only one output layer ▫Pass feature maps from scale to scale instead of coarse predictions ▫Simplifies training  can now train net jointly-ish

Coarse-to-Fine Approach Images from [1]

Model Comparison ECCV 2014 [2] ICCV 2015 [1]

The Architecture Coarse block is nearly identical to [2] except deeper ▫Trained on two sizes: AlexNet [6] and VGG [7] Mid-level resolution block builds on global output from coarse block ▫Concatenates coarse features with single layer of finer- stride convolution/pooling ▫Continues processing features at mid-level resolution Highest resolution does same as mid-level, but at yet finer-stride aligning output to higher resolution

The Training Procedure Scales 1 and 2 are trained jointly by SGD Scale 3 is subsequently trained with scales 1 and 2 held fixed At scale 3, random 74x55 crops are used ▫On output of scales 1, 2 and original input All 3 tasks have roughly same initialization and learning rates at each layer

“It is a marvelous pain to find out but a short way by long wandering.” -Roger Ascham from the “The Schoolmaster”

But first! Remember: this paper aims to create a multipurpose architecture Each task is thus defined only by its training set and loss function

Task 1: Depth Estimation Train network on NYUDepth v2 Similar loss to [2] but including a gradient- matching term Better results using VGG than AlexNet ▫Attributed to larger model size

Task 1: Depth Estimation RESULTS A: RGB input B: Result of [2] C: Output of multipurpose net D: Ground truth Sharpness improvement over [2] Substantial numerical improvement against peer papers as well ABCD Image from [1]

Task 2: Surface Normals

Comparison of surface normal results Image from [1]

Task 3: Semantic Segmentation

RESULTS Tested on NYUDepth v2 (top) as well as Pascal VOC 2011 [6] (bottom) For Pascal VOC 2011: ▫A: Original RGB input ▫B: Predicted labeling ▫C: Ground truth ABC ABC Image from [1]

(All good things must end)

Empirical Multi-Scale Effects Improvement increases as more scales are added Coarse scale most important for depth and normal estimation Mid-level scale most important for segmentation ▫But when only using RGB, coarse scale contributes more

Discussion

Selected References [1] Eigen, David, and Rob Fergus. "Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture."Proceedings of the IEEE International Conference on Computer Vision. 2015. [2] Eigen, David, Christian Puhrsch, and Rob Fergus. "Depth map prediction from a single image using a multi-scale deep network." Advances in neural information processing systems. 2014. [3] Saxena, Ashutosh, Sung H. Chung, and Andrew Y. Ng. "Learning depth from single monocular images." Advances in Neural Information Processing Systems. 2005. [4] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012. [5] Geiger, Andreas, Philip Lenz, and Raquel Urtasun. "Are we ready for autonomous driving? the kitti vision benchmark suite." Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012. [6] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [7] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556(2014).

A Grand Unifying Architecture for Scene Understanding Marc Eder March 23, 2016 Eigen, David, and Rob Fergus. “Predicting depth, surface normals and semantic.

Similar presentations

Presentation on theme: "A Grand Unifying Architecture for Scene Understanding Marc Eder March 23, 2016 Eigen, David, and Rob Fergus. “Predicting depth, surface normals and semantic."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Grand Unifying Architecture for Scene Understanding Marc Eder March 23, 2016 Eigen, David, and Rob Fergus. “Predicting depth, surface normals and semantic.

Similar presentations

Presentation on theme: "A Grand Unifying Architecture for Scene Understanding Marc Eder March 23, 2016 Eigen, David, and Rob Fergus. “Predicting depth, surface normals and semantic."— Presentation transcript:

Similar presentations

About project

Feedback