Learning Deconvolution Network for Semantic Segmentation

Slides:



Advertisements
Similar presentations
ImageNet Classification with Deep Convolutional Neural Networks
Advertisements

Learning Convolutional Feature Hierarchies for Visual Recognition
Spatial Pyramid Pooling in Deep Convolutional
Fully Convolutional Networks for Semantic Segmentation
Feedforward semantic segmentation with zoom-out features
Unsupervised Visual Representation Learning by Context Prediction
Deep Residual Learning for Image Recognition
Gaussian Conditional Random Field Network for Semantic Segmentation
Attention Model in NLP Jichuan ZENG.
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Deeply-Recursive Convolutional Network for Image Super-Resolution
Recent developments in object detection
Deep Residual Learning for Image Recognition
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
Demo.
Summary of “Efficient Deep Learning for Stereo Matching”
Object Detection based on Segment Masks
Deep Learning Amin Sobhani.
Compact Bilinear Pooling
Object detection with deformable part-based models
Automatic Lung Cancer Diagnosis from CT Scans (Week 2)
Data Mining, Neural Network and Genetic Programming
Data Mining, Neural Network and Genetic Programming
Computer Science and Engineering, Seoul National University
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
A Neural Approach to Blind Motion Deblurring
Announcements Project proposal due tomorrow
Combining CNN with RNN for scene labeling (segmentation)
YOLO9000:Better, Faster, Stronger
Synthesis of X-ray Projections via Deep Learning
Training Techniques for Deep Neural Networks
Efficient Deep Model for Monocular Road Segmentation
CS6890 Deep Learning Weizhen Cai
Adversarially Tuned Scene Generation
Image Question Answering
Object detection.
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Fully Convolutional Networks for Semantic Segmentation
Computer Vision James Hays
CNNs and compressive sensing Theoretical analysis
Attention-based Caption Description Mun Jonghwan.
Introduction to Neural Networks
Image Classification.
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Very Deep Convolutional Networks for Large-Scale Image Recognition
Object Detection Creation from Scratch Samsung R&D Institute Ukraine
Semantic segmentation
Lecture: Deep Convolutional Neural Networks
Papers 15/08.
Outline Background Motivation Proposed Model Experimental Results
Visualizing and Understanding Convolutional Networks
Object Tracking: Comparison of
Analysis of Trained CNN (Receptive Field & Weights of Network)
RCNN, Fast-RCNN, Faster-RCNN
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Advances in Deep Audio and Audio-Visual Processing
边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University
Heterogeneous convolutional neural networks for visual recognition
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Department of Computer Science Ben-Gurion University of the Negev
Human-object interaction
Deep Object Co-Segmentation
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Semantic Segmentation
Object Detection Implementations
Presented By: Harshul Gupta
End-to-End Facial Alignment and Recognition
Week 7 Presentation Ngoc Ta Aidean Sharghi
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Learning Deconvolution Network for Semantic Segmentation 2015.05.14 Computer Vision Lab. Hyeonwoo Noh, Seunghoon Hong, Bohyung Han

Contents Semantic Segmentation Previous CNN based Semantic Segmentation Our Approach Discussion Future Direction

Semantic Segmentation Objective: Recognition of objects in the image with pixel level detail. lion person person person person dog bicycle person giraffe person ball bicycle bicycle dog Image Classification Object Detection Semantic Segmentation

Previous CNN based Semantic Segmentation Fully Convolutional Networks for Semantic Segmentation [FCN] [1] Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs [Deeplab-CRF] [2] [1] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015 [2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015

Fully Convolutional Networks for Semantic Segmentation [FCN] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

Approach Apply classification CNN to large image with large padding FCN

How to Apply CNN to Large Input Fully connected Layer is also convolution layer. 1 1 16 16 fc7 4096 fc7 4096 4096 fc7 1 1 4096 1 fc6 1 1 16 1 16 fc6 4096 fc6 4096 7 7 pool5 512 22 7 512 7 512 22 pool5 pool5 Fully connected layer Convolution layer Apply to Large Input

How to Obtain Higher Resolution Output - 1 Single Deconvolution Layer with Large Stride This deconvolution filter is initialized with bilinear weight Equivalent to Bilinear Interpolation 544 16 16 544

How to Obtain Higher Resolution Output - 2 Skip Architecture Pros: Low level feature are larger than high level feature spatially. Cons: Low level feature are less discriminative

Limitations Coarse output score maps Skip architecture generate noisy predictions Fixed receptive field size (cause label confusion)

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs [Deeplab-CRF] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015

Approach Better than FCN in two aspects Note: Produce denser output prediction using “Hole Algorithm” CRF based post processing Note: this algorithm doesn’t use skip architecture Simply upscale output score map without deconvolution layer

Hole Algorithm Why output score map is smaller than input image? Because of Pooling with stride Removing Pooling? Pooling with no stride? Make utilizing pre-trained CNN hard Hole Algorithm Solves this problem output pool output pool pool pool FCN with conventional Pooling FCN with Hole Algorithm

Limitations Still coarse output score maps Produce 39x39 output map from 306x306 input image Fixed receptive field size (cause label confusion)

Learning Deconvolution Network for Semantic Segmentation Hyeonwoo Noh, Seunghoon Hong, Bohyung Han

Our Approach Deconvolution Network Instance-wise prediction CNN Architecture designed to generate large output Enables dense output score prediction Instance-wise prediction Inference on object proposals, then aggregate Enables recognition of objects with multiple scales

Deconvolution Network Generate dense segmentation from fc7 feature representation Multi-layer of deconvolution with relu non-linearity Unpooling layer based upscaling

Deconvolution Network Unpooling Place activations to pooled location Preserve structure of activations Deconvolution Densify sparse activations Bases to reconstruct shape input forward propogation

Deconvolution Network Training deconvolution network is difficult Very deep network Large output space Batch Normalization [3] Normalize input of every layer to standard Gaussian distribution Prevent drastic change of input distribution in upper layers Two stage training First stage: training with object centered examples Second stage: training with real object proposals This approach make the network generalize better [3] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Instance-wise Prediction Inference on object proposals, pixel-wise aggregation Objects with multiple scale are detected in different proposals DeconvNet 1. Input image 2. Object proposals 3. Prediction and aggregation 4. Results

Further performance enhancement Ensemble with FCN based model FCN based models have complementary characteristic with ours Ours: capture fine-detail, handle objects with various scales FCN: capture context within image Fully Connected CRF [4] based post-processing [4] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.

Quantitative Result Best among the models trained with VOC2012 training data VOC2012: 12,031 images MSCOCO: 123,287 images

Qualitative Result

Understanding Details of our work Discussion Understanding Details of our work

Importance of Two Stage Training Two Stage Training is not that critical DeconvNet can be trained well with single stage training (directly apply 2nd step) But, Two Stage Training it better. In our experiment, validation accuracy of two stage training is higher (almost 1.00) than single stage training Based on previous experiments, this margin could make big difference in mean IOU score. Detailed study is not yet employed 1st stage 2nd stage

Importance of Batch Normalization [3] Without Batch Normalization, DeconvNet stuck in local minima Experimental result with early DeconvNet model (Binary segmentation, Single deconv layer between unpooling layer) Maximum Segmentation Accuracy: With Batch Normalization: 92.61 (0.18 loss) Without Batch Normalization: 72.41 (0.59 loss) What is Batch Normalization? Brief Introduction Normalize output of every layer to standard Gaussian distribution (in each mini-batch) learn “mean” and “variance” instead Why it works? my opinion ReLU [3] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Training Deconvolution Network Training plot of the best performed model Network Learning Capacity Generalization Ability Accuracy: 87.53 Accuracy: 92.98 Train + val / val Train / val

Training Deconvolution Network Network Learning Capacity fc7 is 4096 dimension vector Is it enough to encode every possible segmentations? fc7

Training Deconvolution Network Improving Generalization Dropout? Not effective Baseline Min Train loss: 0.08 Min Test loss: 0.24 Max Test accuracy: 92.8 Dropout[fc6,fc7](0.2) Min Train loss: 0.11 Min Test loss: 0.23 Max Test accuracy: 92.6 Dropout[fc6,fc7](0.5) Min Train loss: 0.12 Min Test loss: 0.23 Max Test accuracy: 92.6 Dropout[finallayer](0.2) Min Train loss: 0.11 Min Test loss: 0.24 Max Test accuracy: 92.7

Instance-wise Prediction What instance-wise prediction is doing really? (short demo) Instance is work like “attention” rather than object proposal for detection See the image with various aspect Observations are aggregated to image level prediction Disadvantage of Instance-wise prediction Obstacles for End-to-End training When to stop training? How to aggregate predictions (max? average?..) How to construct training data object-centered bounding box? Random cropping? Hard negative mining? Predictions for each proposals are independent Quiz: What is this? Answer

Results on COCOVOC Why we didn’t use MSCOCO? DeconvNet: 63.683 DeconvNet+CRF: 64.843 EDeconvNet+CRF: 69.799

Possible Future Direction Data augmentation: MSCOCO Enhance performance on training set Make Network more flexible (apply hole algorithm to encode more feature) Studying why dropout doesn’t work in our setting Tackle Disadvantage of Instance-wise prediction Attention model [4] [5] [6] [4] Mnih, Volodymyr, Nicolas Heess, and Alex Graves. "Recurrent models of visual attention." Advances in Neural Information Processing Systems. 2014. [5] Tang, Yichuan, Nitish Srivastava, and Ruslan R. Salakhutdinov. "Learning generative models with visual attention." Advances in Neural Information Processing Systems. 2014. [6] Xu, Kelvin, et al. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." arXiv preprint arXiv:1502.03044 (2015).