DeepID-Net: deformable deep convolutional neural network for generic object detection Wanli Ouyang, Ping Luo, Xingyu Zeng, Shi Qiu, Yonglong Tian, Hongsheng.

Slides:



Advertisements
Similar presentations
Rich feature Hierarchies for Accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitandra Malik (UC Berkeley)
Advertisements

Large Scale Visual Recognition Challenge (ILSVRC) 2013: Detection spotlights.
Many slides based on P. FelzenszwalbP. Felzenszwalb General object detection with deformable part-based models.
Object-centric spatial pooling for image classification Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei ECCV 2012.
Large-Scale Object Recognition with Weak Supervision
More sliding window detection: Discriminative part-based models Many slides based on P. FelzenszwalbP. Felzenszwalb.
Good morning, everyone, thank you for coming to my presentation.
R-CNN By Zhang Liliang.
DeeperVision and DeepInsight Solutions
Spatial Pyramid Pooling in Deep Convolutional
From R-CNN to Fast R-CNN
Generic object detection with deformable part-based models
“Secret” of Object Detection Zheng Wu (Summer intern in MSRNE) Sep. 3, 2010 Joint work with Ce Liu (MSRNE) William T. Freeman (MIT) Adam Kalai (MSRNE)
Generic Object Detection
Detection, Segmentation and Fine-grained Localization
Marco Pedersoli, Jordi Gonzàlez, Xu Hu, and Xavier Roca
Object Detection with Discriminatively Trained Part Based Models
O BJECT D ETECTION WITH D ISCRIMINATIVELY T RAINED P ART B ASED M ODELS PRESENTED BY Xiaolong Wang.
Deformable Part Model Presenter : Liu Changyu Advisor : Prof. Alex Hauptmann Interest : Multimedia Analysis April 11 st, 2013.
Deformable Part Models (DPM) Felzenswalb, Girshick, McAllester & Ramanan (2010) Slides drawn from a tutorial By R. Girshick AP 12% 27% 36% 45% 49% 2005.
Training and Evaluating of Object Bank Models Presenter : Changyu Liu Advisor : Prof. Alex Interest : Multimedia Analysis May 16 th, 2013.
Object detection, deep learning, and R-CNNs
Fully Convolutional Networks for Semantic Segmentation
CS 1699: Intro to Computer Vision Detection II: Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 12, 2015.
Object Detection Overview Viola-Jones Dalal-Triggs Deformable models Deep learning.
Learning Features and Parts for Fine-Grained Recognition Authors: Jonathan Krause, Timnit Gebru, Jia Deng, Li-Jia Li, Li Fei-Fei ICPR, 2014 Presented by:
Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Feedforward semantic segmentation with zoom-out features
Unsupervised Visual Representation Learning by Context Prediction
Cascade Region Regression for Robust Object Detection
Convolutional Neural Network
Lecture 4a: Imagenet: Classification with Localization
Fine-grained Fine-grained Recognition( 细粒度分类 ) 沈志强.
Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,
Spatial Localization and Detection
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.
Scale Up Video Understanding with Deep Learning May 30, 2016 Chuang Gan Tsinghua University 1.
Learning Multi-Domain Convolutional Neural Networks for Visual Tracking arXiv : [cs.CV] v1 2015, v Hyeonseob Nam, Bohyung Han Dept.
Recent developments in object detection
Faster R-CNN – Concepts
Deeply learned face representations are sparse, selective, and robust
Object Detection based on Segment Masks
Object detection with deformable part-based models
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Data Driven Attributes for Action Detection
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Regularizing Face Verification Nets To Discrete-Valued Pain Regression
Combining CNN with RNN for scene labeling (segmentation)
YOLO9000:Better, Faster, Stronger
Object detection as supervised classification
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Object detection.
A Convolutional Neural Network Cascade For Face Detection
Computer Vision James Hays
Introduction to Neural Networks
Image Classification.
NormFace:
Counting in Dense Crowds using Deep Learning
Object Detection + Deep Learning
Faster R-CNN By Anthony Martinez.
Outline Background Motivation Proposed Model Experimental Results
RCNN, Fast-RCNN, Faster-RCNN
边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University
Heterogeneous convolutional neural networks for visual recognition
Course Recap and What’s Next?
Abnormally Detection
Human-object interaction
Eliminating Background-Bias for Robust Person Re-identification
CVPR2019 Jiahe Li SiamRPN introduces the region proposal network after the Siamese network and performs joint classification and regression.
Presentation transcript:

DeepID-Net: deformable deep convolutional neural network for generic object detection Wanli Ouyang, Ping Luo, Xingyu Zeng, Shi Qiu, Yonglong Tian, Hongsheng Li, Shuo Yang, Zhe Wang, Yuanjun Xiong, Chen Qian, Zhenyao Zhu, Ruohui Wang, Chen-Change Loy, Xiaogang Wang, Xiaoou Tang  Wanli Ouyang et al. DeepID-Net: multi-stage and deformable deep convolutional neural network for generic object detection, arXiv:1409.3505 [cs.CV]

DeepID-Net: deformable deep convolutional neural network for generic object detection Wanli Ouyang, Ping Luo, Xingyu Zeng, Shi Qiu, Yonglong Tian, Hongsheng Li, Shuo Yang, Zhe Wang, Yuanjun Xiong, Chen Qian, Zhenyao Zhu, Ruohui Wang, Chen-Change Loy, Xiaogang Wang, Xiaoou Tang  Wanli Ouyang et al. DeepID-Net: multi-stage and deformable deep convolutional neural network for generic object detection, arXiv:1409.3505 [cs.CV]

http://mmlab.ie.cuhk.edu.hk/index.html Our most recent deep model on face verification reached 99.15%, the best performance on LFW.

RCNN Selective search AlexNet+SVM Bounding box regression person horse Selective search AlexNet+SVM Bounding box regression Proposed bounding boxes Detection results Refined bounding boxes Image Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR, 2014

RCNN Our approach mAP 31 to 40.9 (45) on val2 Selective search person horse Selective search AlexNet+SVM Bounding box regression Proposed bounding boxes Detection results Refined bounding boxes Image Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

RCNN Our approach Selective search AlexNet+SVM Bounding box regression person horse Selective search AlexNet+SVM Bounding box regression Proposed bounding boxes Detection results Refined bounding boxes Image Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

Bounding box rejection Motivation Speed up feature extraction by ~10 times Improve mean AP by 1% RCNN Selective search: ~ 2400 bounding boxes per image ILSVRC val: ~20,000 images, ~2.4 days ILSVRC test: ~40,000 images, ~4.7days Bounding box rejection by RCNN: For each box, RCNN has 200 scores S1…200 for 200 classes If max(S1…200) < -1.1, reject. 6% remaining bounding boxes Remaining window 100% 20% 6% Recall (val1) 92.2% 89.0% 84.4% Feature extraction time (seconds per image) 10.24 2.88 1.18 Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR, 2014

RCNN Our approach Selective search AlexNet+SVM Bounding box regression person horse Selective search AlexNet+SVM Bounding box regression Proposed bounding boxes Detection results Refined bounding boxes Image Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

DeepID-Net

RCNN Our approach Selective search AlexNet+SVM Bounding box regression person horse Selective search AlexNet+SVM Bounding box regression Proposed bounding boxes Detection results Refined bounding boxes Image Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

Pretraining the deep model RCNN (Cls+Det) AlexNet Pretrain on image-level annotation data with 1000 classes Finetune on object-level annotation data with 200+1 classes Our investigation Classification vs. detection (image vs. tight bounding box)? 1000 classes vs. 200 classes AlexNet or Clarifai or other choices, e.g. GoogleLenet? Complementary

Deep model training – pretrain RCNN (Image Cls+Det) Pretrain on image-level annotation with 1000 classes Finetune on object-level annotation with 200 classes Gap: classification vs. detection, 1000 vs. 200 Image classification Object detection

Deep model training – pretrain RCNN (ImageNet Cls+Det) Pretrain on image-level annotation with 1000 classes Finetune on object-level annotation with 200 classes Gap: classification vs. detection, 1000 vs. 200 Our approach (ImageNet Cls+Loc+Det) Pretrain on image-level annotation with 1000 classes Finetune on object-level annotation with 1000 classes Training scheme Cls+Det Cls+Loc+Det Net structure AlexNet Clarifai mAP (%) on val2 29.9 31.8 33.4

Deep model training – pretrain RCNN (Cls+Det) Pretrain on image-level annotation with 1000 classes Finetune on object-level annotation with 200 classes Gap: classification vs. detection, 1000 vs. 200 Our approach (Loc+Det) Pretrain on object-level annotation with 1000 classes Training scheme Cls+Det Cls+Loc+Det Loc+Det Net structure AlexNet Clarifai mAP (%) on val2 29.9 31.8 33.4 36.0

Deep model design AlexNet or Clarifai Net structure AlexNet Clarifai Annotation level Image Object Bbox rejection n mAP (%) 29.9 34.3 35.6

Result and discussion RCNN (Cls+Det), Our investigation Better pretraining on 1000 classes Image annotation 200 classes (Det) 20.7 1000 classes (Cls-Loc) 31.8

Result and discussion RCNN (Cls+Det), Our investigation Better pretraining on 1000 classes Object-level annotation is more suitable for pretraining Image annotation Object annotation 200 classes (Det) 20.7 32 1000 classes (Cls-Loc) 31.8 36 23% AP increase for rugby ball 17.4% AP increase for hammer

Result and discussion RCNN (ImageNet Cls+Det), Our investigation Better pretraining on 1000 classes Object-level annotation is more suitable for pretraining Clarifai is better. But Alex and Clarifai are complementary on different classes. scorpion AP diff Net structure AlexNet Clarifai Annotation level Image Object Bbox rejection n mAP (%) 29.9 34.3 35.6 class hamster

RCNN Our approach Selective search AlexNet+SVM Bounding box regression person horse Selective search AlexNet+SVM Bounding box regression Proposed bounding boxes Detection results Refined bounding boxes Image Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

Deep model training – def-pooling layer RCNN (ImageNet Cls+Det) Pretrain on image-level annotation with 1000 classes Finetune on object-level annotation with 200 classes Gap: classification vs. detection, 1000 vs. 200 Our approach (ImageNet Loc+Det) Pretrain on object-level annotation with 1000 classes Finetune on object-level annotation with 200 classes with def- pooling layers Net structure Without Def Layer With Def layer mAP (%) on val2 36.0 38.5

Deformation Learning deformation [a] is effective in computer vision society. Missing in deep model. We propose a new deformation constrained pooling layer. [a] P. Felzenszwalb, R. B. Grishick, D.McAllister, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Trans. PAMI, 32:1627–1645, 2010. 

Modeling Part Detectors Different parts have different sizes Design the filters with variable sizes Part models learned from HOG Part models Learned filtered at the second convolutional layer

Deformation Layer [b] [b] Wanli Ouyang, Xiaogang Wang, "Joint Deep Learning for Pedestrian Detection ", ICCV 2013. 

Deformation layer for repeated patterns Pedestrian detection General object detection Assume no repeated pattern Repeated patterns

Deformation layer for repeated patterns Pedestrian detection General object detection Assume no repeated pattern Repeated patterns Only consider one object class Patterns shared across different object classes

Deformation constrained pooling layer Can capture multiple patterns simultaneously

Our deep model with deformation layer Patterns shared across different classes Training scheme Cls+Det Loc+Det Net structure AlexNet Clarifai Clarifai+Def layer Mean AP on val2 0.299 0.360 0.385

RCNN Our approach Selective search AlexNet+SVM Bounding box regression person horse Selective search AlexNet+SVM Bounding box regression Proposed bounding boxes Detection results Refined bounding boxes Image Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

Sub-box features Take the per-channel max/average features of the last fully connected layer from 4 subboxes of the root window. Concatenate subbox features and the features in the root window. Learn an SVM for combining these features. Subboxes are proposed regions that has >0.5 overlap with the four quarter regions. Need not compute features. 0.5 mAP improvement. So far not combined with deformation layer. Used as one of the models in model averaging

RCNN Our approach Selective search AlexNet+SVM Bounding box regression person horse Selective search AlexNet+SVM Bounding box regression Proposed bounding boxes Detection results Refined bounding boxes Image Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

Deep model training – SVM-net RCNN Fine-tune using soft-max loss (Softmax-Net) Train SVM based on the fc7 features of the fine-tuned net.

Deep model training – SVM-net RCNN Fine-tune using soft-max loss (Softmax-Net) Train SVM based on the fc7 features of the fine-tuned net. Replace Soft-max loss by Hinge loss when fine-tuning (SVM-Net) Merge the two steps of RCNN into one Require no feature extraction from training data (~60 hours)

RCNN Our approach Selective search AlexNet+SVM Bounding box regression person horse Selective search AlexNet+SVM Bounding box regression Proposed bounding boxes Detection results Refined bounding boxes Image Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

Context modeling Use the 1000 class Image classification score. ~1% mAP improvement.

Context modeling Use the 1000-class Image classification score. ~1% mAP improvement. Volleyball: improve ap by 8.4% on val2. Volleyball Golf ball Bathing cap

RCNN Our approach Selective search AlexNet+SVM Bounding box regression person horse Selective search AlexNet+SVM Bounding box regression Proposed bounding boxes Detection results Refined bounding boxes Image Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

Model averaging Not only change parameters Net structure: AlexNet(A), Clarifai (C), Deep-ID Net (D), DeepID Net2 (D2) Pretrain: Classification (C), Localization (L) Region rejection or not Loss of net, softmax (S), Hinge loss (H) Choose different sets of models for different object class Model 1 2 3 4 5 6 7 8 9 10 Net structure A C D D2 Pretrain C+L L Reject region? Y N Loss of net S H Mean ap 0.31 0.312 0.321 0.336 0.353 0.36 0.37 0.371 0.374

RCNN Our approach Selective search AlexNet+SVM Bounding box regression person horse Selective search AlexNet+SVM Bounding box regression Proposed bounding boxes Detection results Refined bounding boxes Image Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

Component analysis Our approach Detection Pipeline RCNN Box rejection Clarifai Loc+Det +Def layer +context +bbox regr. Model avg. avg. cls mAP on val2 29.9 30.9 31.8 36.0 37.4 38.2 39.3 40.9 mAP on test 40.3 New result on val2 38.5 39.2 40.1 42.4 45.0 New result on test 38.0 38.6 39.4 41.7 Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

Component analysis Our approach Selective search Box rejection person horse DeepID-Net Pretrain, def-pooling layer, sub-box, hinge-loss Proposed bounding boxes Remaining bounding boxes Context modeling Image person horse Bounding box regression person horse Model averaging person horse

Component analysis New results (training time, time limit (context)) Detection Pipeline RCNN Box rejection Clarifai Loc+Det +Def layer +context +bbox regr. Model avg. avg. cls mAP on val2 29.9 30.9 31.8 36.0 37.4 38.2 39.3 40.9 mAP on test 40.3 New result on val2 38.5 39.2 40.1 42.4 45.0 New result on test 38.0 38.6 39.4 41.7 New results (training time, time limit (context))

Take home message 1. Bounding rejection. Save feature extraction by about 10 times, slightly improve mAP (~1%). 2. Pre-training with object-level annotation, more classes. 4.2% mAP 3. Def-pooling layer. 2.5% mAP 4. Hinge loss. Save feature computation time (~60 h). 5. Model averaging. Different model designs and training schemes lead to high diversity 6. Multi-stage Deep-id Net (poster)

Acknowledgement We gratefully acknowledge the support of NVIDIA for this research.

Code available soon on http://mmlab. ie. cuhk. edu