on Road Signs & Face Detection

Slides:



Advertisements
Similar presentations
Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.
Advertisements

Spatial Pyramid Pooling in Deep Convolutional
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011 Qian Zhang, King Ngi Ngan Department of Electronic Engineering, the Chinese university.
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
Marco Pedersoli, Jordi Gonzàlez, Xu Hu, and Xavier Roca
A New Fingertip Detection and Tracking Algorithm and Its Application on Writing-in-the-air System The th International Congress on Image and Signal.
HIGH PERFORMANCE OBJECT DETECTION BY COLLABORATIVE LEARNING OF JOINT RANKING OF GRANULES FEATURES Chang Huang and Ram Nevatia University of Southern California,
FACE DETECTION : AMIT BHAMARE. WHAT IS FACE DETECTION ? Face detection is computer based technology which detect the face in digital image. Trivial task.
Objects localization and recognition
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
Assignment 4: Deep Convolutional Neural Networks
1 Munther Abualkibash University of Bridgeport, CT.
National Taiwan Normal A System to Detect Complex Motion of Nearby Vehicles on Freeways C. Y. Fang Department of Information.
Radboud University Medical Center, Nijmegen, Netherlands
Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
Analysis of Sparse Convolutional Neural Networks
How to forecast solar flares?
CS262: Computer Vision Lect 06: Face Detection
Deeply learned face representations are sparse, selective, and robust
Object Detection based on Segment Masks
Compact Bilinear Pooling
an introduction to: Deep Learning
Chilimbi, et al. (2014) Microsoft Research
Computer Science and Engineering, Seoul National University
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Article Review Todd Hricik.
dawn.cs.stanford.edu/benchmark
Yun-FuLiu Jing-MingGuo Che-HaoChang
Ajita Rattani and Reza Derakhshani,
Classification with Perceptrons Reading:
Natural Language Processing of Knee MRI Reports
Recovery from Occlusion in Deep Feature Space for Face Recognition
Huazhong University of Science and Technology
Efficient Deep Model for Monocular Road Segmentation
FaceNet A Unified Embedding for Face Recognition and Clustering
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
Rapid fire performance testing of 250 websites
New horizons in the artificial vision
By: Kevin Yu Ph.D. in Computer Engineering
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Bird-species Recognition Using Convolutional Neural Network
Introduction to Neural Networks
On-going research on Object Detection *Some modification after seminar
Pose Estimation for non-cooperative Spacecraft Rendevous using CNN
Object Detection Creation from Scratch Samsung R&D Institute Ukraine
Declarative Transfer Learning from Deep CNNs at Scale
Neural Networks Geoff Hulten.
Deep Neural Networks for Onboard Intelligence
YOLO-LITE: A Real-Time Object Detection Web Implementation
Outline Background Motivation Proposed Model Experimental Results
TGS Salt Identification Challenge
Tuning CNN: Tips & Tricks
Object Tracking: Comparison of
John H.L. Hansen & Taufiq Al Babba Hasan
Semantic Similarity Detection
1CECA, Peking University, China
Deep Learning Some slides are from Prof. Andrew Ng of Stanford.
Neural Network Pipeline CONTACT & ACKNOWLEDGEMENTS
Heterogeneous convolutional neural networks for visual recognition
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Scalable light field coding using weighted binary images
DRC with Deep Networks Tanmay Lagare, Arpit Jain, Luis Francisco,
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Object Detection Implementations
Pose Estimation in hockey videos using convolutional neural networks
Adrian E. Gonzalez , David Parra Department of Computer Science
Deep CNN for breast cancer histology Image Analysis
Presentation transcript:

on Road Signs & Face Detection RETINA vs YOLO on Road Signs & Face Detection

CHALLENGE INTRODUCTION OBJECT DETECTION: Computer vision technique that deals with identifying various objects in digital images or videos It provides information about “what” and “where” the object is; This work focuses on a performance and complexity comparison between our object detector Retina and the deep CNN architecture YOLO (You Only Look Once) The systems are evaluated on public dataset available online for the two following localization tasks: signs road detection and faces detection.

ROAD SIGNS DETECTION Reference Dataset: http://www.cvl.isy.liu.se/research/datasets/traffic-signs-dataset/download/ 1 Info: ≈ 20,000 partially labelled images (3000 used) Dataset organization: 2000 train, 200 val, 800 test N classes: 7 Min. Object Size: 24x24 pixels (Prohibitory, Speed Prohibitory, Priority Road, Mandatory, Warning, Give Way, Pedestrian Crossing) 1 Published in conjunction with the paper by Fredrik Larsson and Michael Felsberg , Using Fourier Descriptors and Spatial Models for Traffic Sign Recognition, In Proceedings of the 17th Scandinavian Conference on Image Analysis, SCIA 2011

FACE DETECTION Reference Dataset: http://vis-www.cs.umass.edu/fddb/ 2 Info: 2845 labelled images (totally used) Dataset organization: 1800 train, 200 val, 845 test N classes: 1 (Face) Min. Object Size: 20x20 pixels N.B. The dataset includes critical instances with extreme orientations, occlusions and blurring 2 Vidit Jain and Erik Learned-Miller, FDDB: A Benchmark for Face Detection in Unconstrained Settings, Technical Report UM-CS-2010-009, Dept. of Computer Science, University of Massachusetts, Amherst. 2010.

FACE DETECTION: CRITICAL INSTANCES Strong occlusion Blurring + partial occlusions Partial occlusions Some partially occluded faces are not labelled!

YOLO: DETAILS & PARAMETERS SETTING Architecture details: Version: YOLOv2 608x608 Pre-training: COCO dataset (download weights, configuration file) Useful guidelines, scripts and functions available here Parameters setting: (equal for both the detection tasks) TRAINING TEST Pre-Processing: YES (random saturation, exposure and sharpness) N layers trained: entire model Optimizer: SGD (η=0.01, decay=1∙10-5 , γ=0.9)* Batch Size: 20 N epochs: traced by the best results on validation (194 for road signs, 297 for faces) Pre-Processing: NO Conf. score threshold: 0.5 IOU for NMS ** : 0.3 Ground truth/Prediction IOU: 0.5 **IOU = Intersection Over Unit NMS = Non-Maximum Suppression * η = learning rate, γ = momentum

RETINA: DETAILS & SETTING GUI & Library details: Version: Retina v1.6.0 (demo version available here) Models setting: PROPERTY ROAD SIGNS FACES Model Dimensions Object Distance Coarse & Fine Step Perturbations 40 x 40 pixels x = 20, y = 20 coarse = (4,4), fine = (4,4) NO 48 x 64 pixels x = 22, y = 22 coarse = (8,8), fine = (4,4) NO Training Options: OPTION ROAD SIGNS FACES Goodness Target Optimization Mode Features 0.5 Selected 0A, 2B 0.5 Slow 2B.R, 2A.R, 2B.G, 2B.B, 2A.B, 0B

PERFORMANCE Road Signs Dataset Faces Dataset Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Prohibitory 96.39% 93.02% 90.63% 96.64% 91.90% 88.82% Speed Proh. 95.24% 99.62% 95.14% 99.20% 97.31% 96.56% Priority Road 95.56% 97.33% 93.73% 100.0% 90.14% Mandatory 96.09% 94.45% 91.80% 93.82% 91.56% 86.36% Warning 97.40% 92.59% 91.75% 94.44% 85.89% 82.70% Give-Way 98.46% 97.50% 78.13% Pedestrian Cr. 98.05% 95.26% 93.89% 89.47% Total 96.41% 96.19% 92.94% 97.78% 91.12% 89.39% Road Signs Dataset Faces Dataset Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Face 94.43% 84.7% 80.66% 93.04% 73.57% 69.74%

PERFORMANCE Road Signs Dataset Faces Dataset TOP RECALL SCORES Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Prohibitory 96.39% 93.02% 90.63% 96.64% 91.90% 88.82% Speed Proh. 95.24% 99.62% 95.14% 99.20% 97.31% 96.56% Priority Road 95.56% 97.33% 93.73% 100.0% 90.14% Mandatory 96.09% 94.45% 91.80% 93.82% 91.56% 86.36% Warning 97.40% 92.59% 91.75% 94.44% 85.89% 82.70% Give-Way 98.46% 97.50% 78.13% Pedestrian Cr. 98.05% 95.26% 93.89% 89.47% Total 96.41% 96.19% 92.94% 97.78% 91.12% 89.39% Road Signs Dataset Faces Dataset Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Face 94.43% 84.7% 80.66% 93.04% 73.57% 69.74%

PERFORMANCE Road Signs Dataset Faces Dataset TOP PRECISION SCORES Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Prohibitory 96.39% 93.02% 90.63% 96.64% 91.90% 88.82% Speed Proh. 95.24% 99.62% 95.14% 99.20% 97.31% 96.56% Priority Road 95.56% 97.33% 93.73% 100.0% 90.14% Mandatory 96.09% 94.45% 91.80% 93.82% 91.56% 86.36% Warning 97.40% 92.59% 91.75% 94.44% 85.89% 82.70% Give-Way 98.46% 97.50% 78.13% Pedestrian Cr. 98.05% 95.26% 93.89% 89.47% Total 96.41% 96.19% 92.94% 97.78% 91.12% 89.39% Road Signs Dataset Faces Dataset Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Face 94.43% 84.7% 80.66% 93.04% 73.57% 69.74%

TESTING RESULTS Retina v1.6.0 OK Retina v1.6.0 On Road Signs Dataset Retina proves to generally have a higher precision than YOLO EX: YOLO localizes a priority road signal where there is a satellite dish OK KO: no road sign here! YOLOv2 608x608

TESTING RESULTS Retina v1.6.0 On Road Signs Dataset Retina proves to generally have a higher precision than YOLO EX: YOLO confuses a not classified sign with a priority road one OK OK OK YOLOv2 608x608 KO: this is not a priority road signal!

TESTING RESULTS Retina v1.6.0 However YOLO, thanks to its high generalization capability, is able to detect more critical instances EX: partial occlusions KO: object is not detected OK YOLOv2 608x608

TESTING RESULTS Retina v1.6.0 KO: object is not detected Retina v1.6.0 However YOLO, thanks to its high generalization capability, is able to detect more critical instances EX: unusual orientations OK OK YOLOv2 608x608 OK

COMPLEXITY ANALYSIS What kind of computational resource do you need? (The following results are obtained using: *CPU: Intel Core i7-8700K, 3.70GHz, **GPU: Nvidia GeForce Quadro P5000, 16.0 GB) Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU* 4.9% (100) 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU** 100% (2000) 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU* 4% (73) 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU** 100% (1800) 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS How many training images? Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% (100) 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% (2000) 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% (73) 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% (1800) 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS How long does the training procedure take? Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS How many parameters have to be trained? Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS How many time to perform a detection? Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS Amount of inner computations to perform a detection? (only multiplications are taken into account) Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS Is the project easily portable? Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB

CONCLUSIONS (1) Property Retina v1.6 YOLO 608x608 Signs Road Faces Training images 100 73 2000 1800 Accuracy 89.39% 69.74% 92.94% 80.66% Hardware CPU GPU This work is the result of a master thesis of the Information Engineering Department at University of Brescia, whose objective is to compare the approach used by the Retina library with the more complex one based on CNN architecture as YOLO.

CONCLUSIONS (2) Retina was developed for industrial application in order to be easily usable and portable on traditional hardware without the complexity of GPUs. As it is demonstrated in the previous slides, Retina requires much less hardware resources and can be trained with datasets of 2 orders of magnitude lower than YOLO (pre-trained on 300K images). To evaluate the detection systems public datasets not related to the industrial world are used. Despite this, the result achieved by Retina were comparable with those of YOLO, confirming its potential.

CONCLUSIONS (2) Retina was developed for industrial application in order to be easily usable and portable on traditional hardware without the complexity of GPUs. As it is demonstrated in the previous slides, Retina requires much less hardware resources and can be trained with datasets of 2 orders of magnitude lower than YOLO (pre-trained on 300K images). To evaluate the detection systems public datasets not related to the industrial world are used. Despite this, the result achieved by Retina were comparable with those of YOLO, confirming its potential.