Learning Deconvolution Network for Semantic Segmentation 2015.05.14 Computer Vision Lab. Hyeonwoo Noh, Seunghoon Hong, Bohyung Han
Contents Semantic Segmentation Previous CNN based Semantic Segmentation Our Approach Discussion Future Direction
Semantic Segmentation Objective: Recognition of objects in the image with pixel level detail. lion person person person person dog bicycle person giraffe person ball bicycle bicycle dog Image Classification Object Detection Semantic Segmentation
Previous CNN based Semantic Segmentation Fully Convolutional Networks for Semantic Segmentation [FCN] [1] Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs [Deeplab-CRF] [2] [1] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015 [2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015
Fully Convolutional Networks for Semantic Segmentation [FCN] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015
Approach Apply classification CNN to large image with large padding FCN
How to Apply CNN to Large Input Fully connected Layer is also convolution layer. 1 1 16 16 fc7 4096 fc7 4096 4096 fc7 1 1 4096 1 fc6 1 1 16 1 16 fc6 4096 fc6 4096 7 7 pool5 512 22 7 512 7 512 22 pool5 pool5 Fully connected layer Convolution layer Apply to Large Input
How to Obtain Higher Resolution Output - 1 Single Deconvolution Layer with Large Stride This deconvolution filter is initialized with bilinear weight Equivalent to Bilinear Interpolation 544 16 16 544
How to Obtain Higher Resolution Output - 2 Skip Architecture Pros: Low level feature are larger than high level feature spatially. Cons: Low level feature are less discriminative
Limitations Coarse output score maps Skip architecture generate noisy predictions Fixed receptive field size (cause label confusion)
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs [Deeplab-CRF] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015
Approach Better than FCN in two aspects Note: Produce denser output prediction using “Hole Algorithm” CRF based post processing Note: this algorithm doesn’t use skip architecture Simply upscale output score map without deconvolution layer
Hole Algorithm Why output score map is smaller than input image? Because of Pooling with stride Removing Pooling? Pooling with no stride? Make utilizing pre-trained CNN hard Hole Algorithm Solves this problem output pool output pool pool pool FCN with conventional Pooling FCN with Hole Algorithm
Limitations Still coarse output score maps Produce 39x39 output map from 306x306 input image Fixed receptive field size (cause label confusion)
Learning Deconvolution Network for Semantic Segmentation Hyeonwoo Noh, Seunghoon Hong, Bohyung Han
Our Approach Deconvolution Network Instance-wise prediction CNN Architecture designed to generate large output Enables dense output score prediction Instance-wise prediction Inference on object proposals, then aggregate Enables recognition of objects with multiple scales
Deconvolution Network Generate dense segmentation from fc7 feature representation Multi-layer of deconvolution with relu non-linearity Unpooling layer based upscaling
Deconvolution Network Unpooling Place activations to pooled location Preserve structure of activations Deconvolution Densify sparse activations Bases to reconstruct shape input forward propogation
Deconvolution Network Training deconvolution network is difficult Very deep network Large output space Batch Normalization [3] Normalize input of every layer to standard Gaussian distribution Prevent drastic change of input distribution in upper layers Two stage training First stage: training with object centered examples Second stage: training with real object proposals This approach make the network generalize better [3] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Instance-wise Prediction Inference on object proposals, pixel-wise aggregation Objects with multiple scale are detected in different proposals DeconvNet 1. Input image 2. Object proposals 3. Prediction and aggregation 4. Results
Further performance enhancement Ensemble with FCN based model FCN based models have complementary characteristic with ours Ours: capture fine-detail, handle objects with various scales FCN: capture context within image Fully Connected CRF [4] based post-processing [4] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.
Quantitative Result Best among the models trained with VOC2012 training data VOC2012: 12,031 images MSCOCO: 123,287 images
Qualitative Result
Understanding Details of our work Discussion Understanding Details of our work
Importance of Two Stage Training Two Stage Training is not that critical DeconvNet can be trained well with single stage training (directly apply 2nd step) But, Two Stage Training it better. In our experiment, validation accuracy of two stage training is higher (almost 1.00) than single stage training Based on previous experiments, this margin could make big difference in mean IOU score. Detailed study is not yet employed 1st stage 2nd stage
Importance of Batch Normalization [3] Without Batch Normalization, DeconvNet stuck in local minima Experimental result with early DeconvNet model (Binary segmentation, Single deconv layer between unpooling layer) Maximum Segmentation Accuracy: With Batch Normalization: 92.61 (0.18 loss) Without Batch Normalization: 72.41 (0.59 loss) What is Batch Normalization? Brief Introduction Normalize output of every layer to standard Gaussian distribution (in each mini-batch) learn “mean” and “variance” instead Why it works? my opinion ReLU [3] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Training Deconvolution Network Training plot of the best performed model Network Learning Capacity Generalization Ability Accuracy: 87.53 Accuracy: 92.98 Train + val / val Train / val
Training Deconvolution Network Network Learning Capacity fc7 is 4096 dimension vector Is it enough to encode every possible segmentations? fc7
Training Deconvolution Network Improving Generalization Dropout? Not effective Baseline Min Train loss: 0.08 Min Test loss: 0.24 Max Test accuracy: 92.8 Dropout[fc6,fc7](0.2) Min Train loss: 0.11 Min Test loss: 0.23 Max Test accuracy: 92.6 Dropout[fc6,fc7](0.5) Min Train loss: 0.12 Min Test loss: 0.23 Max Test accuracy: 92.6 Dropout[finallayer](0.2) Min Train loss: 0.11 Min Test loss: 0.24 Max Test accuracy: 92.7
Instance-wise Prediction What instance-wise prediction is doing really? (short demo) Instance is work like “attention” rather than object proposal for detection See the image with various aspect Observations are aggregated to image level prediction Disadvantage of Instance-wise prediction Obstacles for End-to-End training When to stop training? How to aggregate predictions (max? average?..) How to construct training data object-centered bounding box? Random cropping? Hard negative mining? Predictions for each proposals are independent Quiz: What is this? Answer
Results on COCOVOC Why we didn’t use MSCOCO? DeconvNet: 63.683 DeconvNet+CRF: 64.843 EDeconvNet+CRF: 69.799
Possible Future Direction Data augmentation: MSCOCO Enhance performance on training set Make Network more flexible (apply hole algorithm to encode more feature) Studying why dropout doesn’t work in our setting Tackle Disadvantage of Instance-wise prediction Attention model [4] [5] [6] [4] Mnih, Volodymyr, Nicolas Heess, and Alex Graves. "Recurrent models of visual attention." Advances in Neural Information Processing Systems. 2014. [5] Tang, Yichuan, Nitish Srivastava, and Ruslan R. Salakhutdinov. "Learning generative models with visual attention." Advances in Neural Information Processing Systems. 2014. [6] Xu, Kelvin, et al. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." arXiv preprint arXiv:1502.03044 (2015).