Learning Deconvolution Network for Semantic Segmentation

Slides:

Advertisements

Similar presentations

ImageNet Classification with Deep Convolutional Neural Networks

Advertisements

Learning Convolutional Feature Hierarchies for Visual Recognition

Spatial Pyramid Pooling in Deep Convolutional

Fully Convolutional Networks for Semantic Segmentation

Feedforward semantic segmentation with zoom-out features

Unsupervised Visual Representation Learning by Context Prediction

Deep Residual Learning for Image Recognition

Gaussian Conditional Random Field Network for Semantic Segmentation

Attention Model in NLP Jichuan ZENG.

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Deeply-Recursive Convolutional Network for Image Super-Resolution

Recent developments in object detection

Deep Residual Learning for Image Recognition

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

Summary of “Efficient Deep Learning for Stereo Matching”

Object Detection based on Segment Masks

Deep Learning Amin Sobhani.

Compact Bilinear Pooling

Object detection with deformable part-based models

Automatic Lung Cancer Diagnosis from CT Scans (Week 2)

Data Mining, Neural Network and Genetic Programming

Data Mining, Neural Network and Genetic Programming

Computer Science and Engineering, Seoul National University

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

A Neural Approach to Blind Motion Deblurring

Announcements Project proposal due tomorrow

Combining CNN with RNN for scene labeling (segmentation)

YOLO9000:Better, Faster, Stronger

Synthesis of X-ray Projections via Deep Learning

Training Techniques for Deep Neural Networks

Efficient Deep Model for Monocular Road Segmentation

CS6890 Deep Learning Weizhen Cai

Adversarially Tuned Scene Generation

Image Question Answering

Object detection.

Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang

Fully Convolutional Networks for Semantic Segmentation

Computer Vision James Hays

CNNs and compressive sensing Theoretical analysis

Attention-based Caption Description Mun Jonghwan.

Introduction to Neural Networks

Image Classification.

Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.

Very Deep Convolutional Networks for Large-Scale Image Recognition

Object Detection Creation from Scratch Samsung R&D Institute Ukraine

Semantic segmentation

Lecture: Deep Convolutional Neural Networks

Outline Background Motivation Proposed Model Experimental Results

Visualizing and Understanding Convolutional Networks

Object Tracking: Comparison of

Analysis of Trained CNN (Receptive Field & Weights of Network)

RCNN, Fast-RCNN, Faster-RCNN

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Advances in Deep Audio and Audio-Visual Processing

边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University

Heterogeneous convolutional neural networks for visual recognition

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Department of Computer Science Ben-Gurion University of the Negev

Human-object interaction

Deep Object Co-Segmentation

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Semantic Segmentation

Object Detection Implementations

Presented By: Harshul Gupta

End-to-End Facial Alignment and Recognition

Week 7 Presentation Ngoc Ta Aidean Sharghi

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

Learning Deconvolution Network for Semantic Segmentation 2015.05.14 Computer Vision Lab. Hyeonwoo Noh, Seunghoon Hong, Bohyung Han

Contents Semantic Segmentation Previous CNN based Semantic Segmentation Our Approach Discussion Future Direction

Semantic Segmentation Objective: Recognition of objects in the image with pixel level detail. lion person person person person dog bicycle person giraffe person ball bicycle bicycle dog Image Classification Object Detection Semantic Segmentation

Previous CNN based Semantic Segmentation Fully Convolutional Networks for Semantic Segmentation [FCN] [1] Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs [Deeplab-CRF] [2] [1] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015 [2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015

Fully Convolutional Networks for Semantic Segmentation [FCN] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

Approach Apply classification CNN to large image with large padding FCN

How to Apply CNN to Large Input Fully connected Layer is also convolution layer. 1 1 16 16 fc7 4096 fc7 4096 4096 fc7 1 1 4096 1 fc6 1 1 16 1 16 fc6 4096 fc6 4096 7 7 pool5 512 22 7 512 7 512 22 pool5 pool5 Fully connected layer Convolution layer Apply to Large Input

How to Obtain Higher Resolution Output - 1 Single Deconvolution Layer with Large Stride This deconvolution filter is initialized with bilinear weight Equivalent to Bilinear Interpolation 544 16 16 544

How to Obtain Higher Resolution Output - 2 Skip Architecture Pros: Low level feature are larger than high level feature spatially. Cons: Low level feature are less discriminative

Limitations Coarse output score maps Skip architecture generate noisy predictions Fixed receptive field size (cause label confusion)

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs [Deeplab-CRF] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015

Approach Better than FCN in two aspects Note: Produce denser output prediction using “Hole Algorithm” CRF based post processing Note: this algorithm doesn’t use skip architecture Simply upscale output score map without deconvolution layer

Hole Algorithm Why output score map is smaller than input image? Because of Pooling with stride Removing Pooling? Pooling with no stride? Make utilizing pre-trained CNN hard Hole Algorithm Solves this problem output pool output pool pool pool FCN with conventional Pooling FCN with Hole Algorithm

Limitations Still coarse output score maps Produce 39x39 output map from 306x306 input image Fixed receptive field size (cause label confusion)

Learning Deconvolution Network for Semantic Segmentation Hyeonwoo Noh, Seunghoon Hong, Bohyung Han

Our Approach Deconvolution Network Instance-wise prediction CNN Architecture designed to generate large output Enables dense output score prediction Instance-wise prediction Inference on object proposals, then aggregate Enables recognition of objects with multiple scales

Deconvolution Network Generate dense segmentation from fc7 feature representation Multi-layer of deconvolution with relu non-linearity Unpooling layer based upscaling

Deconvolution Network Unpooling Place activations to pooled location Preserve structure of activations Deconvolution Densify sparse activations Bases to reconstruct shape input forward propogation

Deconvolution Network Training deconvolution network is difficult Very deep network Large output space Batch Normalization [3] Normalize input of every layer to standard Gaussian distribution Prevent drastic change of input distribution in upper layers Two stage training First stage: training with object centered examples Second stage: training with real object proposals This approach make the network generalize better [3] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Instance-wise Prediction Inference on object proposals, pixel-wise aggregation Objects with multiple scale are detected in different proposals DeconvNet 1. Input image 2. Object proposals 3. Prediction and aggregation 4. Results

Further performance enhancement Ensemble with FCN based model FCN based models have complementary characteristic with ours Ours: capture fine-detail, handle objects with various scales FCN: capture context within image Fully Connected CRF [4] based post-processing [4] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.

Quantitative Result Best among the models trained with VOC2012 training data VOC2012: 12,031 images MSCOCO: 123,287 images

Qualitative Result

Understanding Details of our work Discussion Understanding Details of our work

Importance of Two Stage Training Two Stage Training is not that critical DeconvNet can be trained well with single stage training (directly apply 2nd step) But, Two Stage Training it better. In our experiment, validation accuracy of two stage training is higher (almost 1.00) than single stage training Based on previous experiments, this margin could make big difference in mean IOU score. Detailed study is not yet employed 1st stage 2nd stage

Importance of Batch Normalization [3] Without Batch Normalization, DeconvNet stuck in local minima Experimental result with early DeconvNet model (Binary segmentation, Single deconv layer between unpooling layer) Maximum Segmentation Accuracy: With Batch Normalization: 92.61 (0.18 loss) Without Batch Normalization: 72.41 (0.59 loss) What is Batch Normalization? Brief Introduction Normalize output of every layer to standard Gaussian distribution (in each mini-batch) learn “mean” and “variance” instead Why it works? my opinion ReLU [3] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Training Deconvolution Network Training plot of the best performed model Network Learning Capacity Generalization Ability Accuracy: 87.53 Accuracy: 92.98 Train + val / val Train / val

Training Deconvolution Network Network Learning Capacity fc7 is 4096 dimension vector Is it enough to encode every possible segmentations? fc7

Training Deconvolution Network Improving Generalization Dropout? Not effective Baseline Min Train loss: 0.08 Min Test loss: 0.24 Max Test accuracy: 92.8 Dropout[fc6,fc7](0.2) Min Train loss: 0.11 Min Test loss: 0.23 Max Test accuracy: 92.6 Dropout[fc6,fc7](0.5) Min Train loss: 0.12 Min Test loss: 0.23 Max Test accuracy: 92.6 Dropout[finallayer](0.2) Min Train loss: 0.11 Min Test loss: 0.24 Max Test accuracy: 92.7

Instance-wise Prediction What instance-wise prediction is doing really? (short demo) Instance is work like “attention” rather than object proposal for detection See the image with various aspect Observations are aggregated to image level prediction Disadvantage of Instance-wise prediction Obstacles for End-to-End training When to stop training? How to aggregate predictions (max? average?..) How to construct training data object-centered bounding box? Random cropping? Hard negative mining? Predictions for each proposals are independent Quiz: What is this? Answer

Results on COCOVOC Why we didn’t use MSCOCO? DeconvNet: 63.683 DeconvNet+CRF: 64.843 EDeconvNet+CRF: 69.799

Possible Future Direction Data augmentation: MSCOCO Enhance performance on training set Make Network more flexible (apply hole algorithm to encode more feature) Studying why dropout doesn’t work in our setting Tackle Disadvantage of Instance-wise prediction Attention model [4] [5] [6] [4] Mnih, Volodymyr, Nicolas Heess, and Alex Graves. "Recurrent models of visual attention." Advances in Neural Information Processing Systems. 2014. [5] Tang, Yichuan, Nitish Srivastava, and Ruslan R. Salakhutdinov. "Learning generative models with visual attention." Advances in Neural Information Processing Systems. 2014. [6] Xu, Kelvin, et al. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." arXiv preprint arXiv:1502.03044 (2015).