on Road Signs & Face Detection

Slides:

Advertisements

Similar presentations

Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.

Advertisements

Spatial Pyramid Pooling in Deep Convolutional

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011 Qian Zhang, King Ngi Ngan Department of Electronic Engineering, the Chinese university.

A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.

Marco Pedersoli, Jordi Gonzàlez, Xu Hu, and Xavier Roca

A New Fingertip Detection and Tracking Algorithm and Its Application on Writing-in-the-air System The th International Congress on Image and Signal.

HIGH PERFORMANCE OBJECT DETECTION BY COLLABORATIVE LEARNING OF JOINT RANKING OF GRANULES FEATURES Chang Huang and Ram Nevatia University of Southern California,

FACE DETECTION : AMIT BHAMARE. WHAT IS FACE DETECTION ? Face detection is computer based technology which detect the face in digital image. Trivial task.

Objects localization and recognition

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

Assignment 4: Deep Convolutional Neural Networks

1 Munther Abualkibash University of Bridgeport, CT.

National Taiwan Normal A System to Detect Complex Motion of Nearby Vehicles on Freeways C. Y. Fang Department of Information.

Radboud University Medical Center, Nijmegen, Netherlands

Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

Analysis of Sparse Convolutional Neural Networks

How to forecast solar flares?

CS262: Computer Vision Lect 06: Face Detection

Deeply learned face representations are sparse, selective, and robust

Object Detection based on Segment Masks

Compact Bilinear Pooling

an introduction to: Deep Learning

Chilimbi, et al. (2014) Microsoft Research

Computer Science and Engineering, Seoul National University

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Article Review Todd Hricik.

dawn.cs.stanford.edu/benchmark

Yun-FuLiu Jing-MingGuo Che-HaoChang

Ajita Rattani and Reza Derakhshani,

Classification with Perceptrons Reading:

Natural Language Processing of Knee MRI Reports

Recovery from Occlusion in Deep Feature Space for Face Recognition

Huazhong University of Science and Technology

Efficient Deep Model for Monocular Road Segmentation

FaceNet A Unified Embedding for Face Recognition and Clustering

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Rapid fire performance testing of 250 websites

New horizons in the artificial vision

By: Kevin Yu Ph.D. in Computer Engineering

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Bird-species Recognition Using Convolutional Neural Network

Introduction to Neural Networks

On-going research on Object Detection *Some modification after seminar

Pose Estimation for non-cooperative Spacecraft Rendevous using CNN

Object Detection Creation from Scratch Samsung R&D Institute Ukraine

Declarative Transfer Learning from Deep CNNs at Scale

Neural Networks Geoff Hulten.

Deep Neural Networks for Onboard Intelligence

YOLO-LITE: A Real-Time Object Detection Web Implementation

Outline Background Motivation Proposed Model Experimental Results

TGS Salt Identification Challenge

Tuning CNN: Tips & Tricks

Object Tracking: Comparison of

John H.L. Hansen & Taufiq Al Babba Hasan

Semantic Similarity Detection

1CECA, Peking University, China

Deep Learning Some slides are from Prof. Andrew Ng of Stanford.

Neural Network Pipeline CONTACT & ACKNOWLEDGEMENTS

Heterogeneous convolutional neural networks for visual recognition

An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,

Scalable light field coding using weighted binary images

DRC with Deep Networks Tanmay Lagare, Arpit Jain, Luis Francisco,

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Object Detection Implementations

Pose Estimation in hockey videos using convolutional neural networks

Adrian E. Gonzalez , David Parra Department of Computer Science

Deep CNN for breast cancer histology Image Analysis

Presentation transcript:

on Road Signs & Face Detection RETINA vs YOLO on Road Signs & Face Detection

CHALLENGE INTRODUCTION OBJECT DETECTION: Computer vision technique that deals with identifying various objects in digital images or videos It provides information about “what” and “where” the object is; This work focuses on a performance and complexity comparison between our object detector Retina and the deep CNN architecture YOLO (You Only Look Once) The systems are evaluated on public dataset available online for the two following localization tasks: signs road detection and faces detection.

ROAD SIGNS DETECTION Reference Dataset: http://www.cvl.isy.liu.se/research/datasets/traffic-signs-dataset/download/ 1 Info: ≈ 20,000 partially labelled images (3000 used) Dataset organization: 2000 train, 200 val, 800 test N classes: 7 Min. Object Size: 24x24 pixels (Prohibitory, Speed Prohibitory, Priority Road, Mandatory, Warning, Give Way, Pedestrian Crossing) 1 Published in conjunction with the paper by Fredrik Larsson and Michael Felsberg , Using Fourier Descriptors and Spatial Models for Traffic Sign Recognition, In Proceedings of the 17th Scandinavian Conference on Image Analysis, SCIA 2011

FACE DETECTION Reference Dataset: http://vis-www.cs.umass.edu/fddb/ 2 Info: 2845 labelled images (totally used) Dataset organization: 1800 train, 200 val, 845 test N classes: 1 (Face) Min. Object Size: 20x20 pixels N.B. The dataset includes critical instances with extreme orientations, occlusions and blurring 2 Vidit Jain and Erik Learned-Miller, FDDB: A Benchmark for Face Detection in Unconstrained Settings, Technical Report UM-CS-2010-009, Dept. of Computer Science, University of Massachusetts, Amherst. 2010.

FACE DETECTION: CRITICAL INSTANCES Strong occlusion Blurring + partial occlusions Partial occlusions Some partially occluded faces are not labelled!

YOLO: DETAILS & PARAMETERS SETTING Architecture details: Version: YOLOv2 608x608 Pre-training: COCO dataset (download weights, configuration file) Useful guidelines, scripts and functions available here Parameters setting: (equal for both the detection tasks) TRAINING TEST Pre-Processing: YES (random saturation, exposure and sharpness) N layers trained: entire model Optimizer: SGD (η=0.01, decay=1∙10-5 , γ=0.9)* Batch Size: 20 N epochs: traced by the best results on validation (194 for road signs, 297 for faces) Pre-Processing: NO Conf. score threshold: 0.5 IOU for NMS ** : 0.3 Ground truth/Prediction IOU: 0.5 **IOU = Intersection Over Unit NMS = Non-Maximum Suppression * η = learning rate, γ = momentum

RETINA: DETAILS & SETTING GUI & Library details: Version: Retina v1.6.0 (demo version available here) Models setting: PROPERTY ROAD SIGNS FACES Model Dimensions Object Distance Coarse & Fine Step Perturbations 40 x 40 pixels x = 20, y = 20 coarse = (4,4), fine = (4,4) NO 48 x 64 pixels x = 22, y = 22 coarse = (8,8), fine = (4,4) NO Training Options: OPTION ROAD SIGNS FACES Goodness Target Optimization Mode Features 0.5 Selected 0A, 2B 0.5 Slow 2B.R, 2A.R, 2B.G, 2B.B, 2A.B, 0B

PERFORMANCE Road Signs Dataset Faces Dataset Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Prohibitory 96.39% 93.02% 90.63% 96.64% 91.90% 88.82% Speed Proh. 95.24% 99.62% 95.14% 99.20% 97.31% 96.56% Priority Road 95.56% 97.33% 93.73% 100.0% 90.14% Mandatory 96.09% 94.45% 91.80% 93.82% 91.56% 86.36% Warning 97.40% 92.59% 91.75% 94.44% 85.89% 82.70% Give-Way 98.46% 97.50% 78.13% Pedestrian Cr. 98.05% 95.26% 93.89% 89.47% Total 96.41% 96.19% 92.94% 97.78% 91.12% 89.39% Road Signs Dataset Faces Dataset Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Face 94.43% 84.7% 80.66% 93.04% 73.57% 69.74%

PERFORMANCE Road Signs Dataset Faces Dataset TOP RECALL SCORES Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Prohibitory 96.39% 93.02% 90.63% 96.64% 91.90% 88.82% Speed Proh. 95.24% 99.62% 95.14% 99.20% 97.31% 96.56% Priority Road 95.56% 97.33% 93.73% 100.0% 90.14% Mandatory 96.09% 94.45% 91.80% 93.82% 91.56% 86.36% Warning 97.40% 92.59% 91.75% 94.44% 85.89% 82.70% Give-Way 98.46% 97.50% 78.13% Pedestrian Cr. 98.05% 95.26% 93.89% 89.47% Total 96.41% 96.19% 92.94% 97.78% 91.12% 89.39% Road Signs Dataset Faces Dataset Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Face 94.43% 84.7% 80.66% 93.04% 73.57% 69.74%

PERFORMANCE Road Signs Dataset Faces Dataset TOP PRECISION SCORES Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Prohibitory 96.39% 93.02% 90.63% 96.64% 91.90% 88.82% Speed Proh. 95.24% 99.62% 95.14% 99.20% 97.31% 96.56% Priority Road 95.56% 97.33% 93.73% 100.0% 90.14% Mandatory 96.09% 94.45% 91.80% 93.82% 91.56% 86.36% Warning 97.40% 92.59% 91.75% 94.44% 85.89% 82.70% Give-Way 98.46% 97.50% 78.13% Pedestrian Cr. 98.05% 95.26% 93.89% 89.47% Total 96.41% 96.19% 92.94% 97.78% 91.12% 89.39% Road Signs Dataset Faces Dataset Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Face 94.43% 84.7% 80.66% 93.04% 73.57% 69.74%

TESTING RESULTS Retina v1.6.0 OK Retina v1.6.0 On Road Signs Dataset Retina proves to generally have a higher precision than YOLO EX: YOLO localizes a priority road signal where there is a satellite dish OK KO: no road sign here! YOLOv2 608x608

TESTING RESULTS Retina v1.6.0 On Road Signs Dataset Retina proves to generally have a higher precision than YOLO EX: YOLO confuses a not classified sign with a priority road one OK OK OK YOLOv2 608x608 KO: this is not a priority road signal!

TESTING RESULTS Retina v1.6.0 However YOLO, thanks to its high generalization capability, is able to detect more critical instances EX: partial occlusions KO: object is not detected OK YOLOv2 608x608

TESTING RESULTS Retina v1.6.0 KO: object is not detected Retina v1.6.0 However YOLO, thanks to its high generalization capability, is able to detect more critical instances EX: unusual orientations OK OK YOLOv2 608x608 OK

COMPLEXITY ANALYSIS What kind of computational resource do you need? (The following results are obtained using: *CPU: Intel Core i7-8700K, 3.70GHz, **GPU: Nvidia GeForce Quadro P5000, 16.0 GB) Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU* 4.9% (100) 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU** 100% (2000) 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU* 4% (73) 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU** 100% (1800) 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS How many training images? Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% (100) 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% (2000) 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% (73) 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% (1800) 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS How long does the training procedure take? Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS How many parameters have to be trained? Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS How many time to perform a detection? Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS Amount of inner computations to perform a detection? (only multiplications are taken into account) Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB

COMPLEXITY ANALYSIS Is the project easily portable? Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB

CONCLUSIONS (1) Property Retina v1.6 YOLO 608x608 Signs Road Faces Training images 100 73 2000 1800 Accuracy 89.39% 69.74% 92.94% 80.66% Hardware CPU GPU This work is the result of a master thesis of the Information Engineering Department at University of Brescia, whose objective is to compare the approach used by the Retina library with the more complex one based on CNN architecture as YOLO.

CONCLUSIONS (2) Retina was developed for industrial application in order to be easily usable and portable on traditional hardware without the complexity of GPUs. As it is demonstrated in the previous slides, Retina requires much less hardware resources and can be trained with datasets of 2 orders of magnitude lower than YOLO (pre-trained on 300K images). To evaluate the detection systems public datasets not related to the industrial world are used. Despite this, the result achieved by Retina were comparable with those of YOLO, confirming its potential.

CONCLUSIONS (2) Retina was developed for industrial application in order to be easily usable and portable on traditional hardware without the complexity of GPUs. As it is demonstrated in the previous slides, Retina requires much less hardware resources and can be trained with datasets of 2 orders of magnitude lower than YOLO (pre-trained on 300K images). To evaluate the detection systems public datasets not related to the industrial world are used. Despite this, the result achieved by Retina were comparable with those of YOLO, confirming its potential.