Download presentation
Presentation is loading. Please wait.
1
on Road Signs & Face Detection
RETINA vs YOLO on Road Signs & Face Detection
2
CHALLENGE INTRODUCTION
OBJECT DETECTION: Computer vision technique that deals with identifying various objects in digital images or videos It provides information about “what” and “where” the object is; This work focuses on a performance and complexity comparison between our object detector Retina and the deep CNN architecture YOLO (You Only Look Once) The systems are evaluated on public dataset available online for the two following localization tasks: signs road detection and faces detection.
3
ROAD SIGNS DETECTION Reference Dataset: 1 Info: ≈ 20,000 partially labelled images (3000 used) Dataset organization: 2000 train, 200 val, 800 test N classes: 7 Min. Object Size: 24x24 pixels (Prohibitory, Speed Prohibitory, Priority Road, Mandatory, Warning, Give Way, Pedestrian Crossing) 1 Published in conjunction with the paper by Fredrik Larsson and Michael Felsberg , Using Fourier Descriptors and Spatial Models for Traffic Sign Recognition, In Proceedings of the 17th Scandinavian Conference on Image Analysis, SCIA 2011
4
FACE DETECTION Reference Dataset: http://vis-www.cs.umass.edu/fddb/ 2
Info: 2845 labelled images (totally used) Dataset organization: 1800 train, 200 val, 845 test N classes: 1 (Face) Min. Object Size: 20x20 pixels N.B. The dataset includes critical instances with extreme orientations, occlusions and blurring 2 Vidit Jain and Erik Learned-Miller, FDDB: A Benchmark for Face Detection in Unconstrained Settings, Technical Report UM-CS , Dept. of Computer Science, University of Massachusetts, Amherst
5
FACE DETECTION: CRITICAL INSTANCES
Strong occlusion Blurring + partial occlusions Partial occlusions Some partially occluded faces are not labelled!
6
YOLO: DETAILS & PARAMETERS SETTING
Architecture details: Version: YOLOv2 608x608 Pre-training: COCO dataset (download weights, configuration file) Useful guidelines, scripts and functions available here Parameters setting: (equal for both the detection tasks) TRAINING TEST Pre-Processing: YES (random saturation, exposure and sharpness) N layers trained: entire model Optimizer: SGD (η=0.01, decay=1∙10-5 , γ=0.9)* Batch Size: 20 N epochs: traced by the best results on validation (194 for road signs, 297 for faces) Pre-Processing: NO Conf. score threshold: 0.5 IOU for NMS ** : 0.3 Ground truth/Prediction IOU: 0.5 **IOU = Intersection Over Unit NMS = Non-Maximum Suppression * η = learning rate, γ = momentum
7
RETINA: DETAILS & SETTING
GUI & Library details: Version: Retina v1.6.0 (demo version available here) Models setting: PROPERTY ROAD SIGNS FACES Model Dimensions Object Distance Coarse & Fine Step Perturbations 40 x 40 pixels x = 20, y = 20 coarse = (4,4), fine = (4,4) NO 48 x 64 pixels x = 22, y = 22 coarse = (8,8), fine = (4,4) NO Training Options: OPTION ROAD SIGNS FACES Goodness Target Optimization Mode Features 0.5 Selected 0A, 2B 0.5 Slow 2B.R, 2A.R, 2B.G, 2B.B, 2A.B, 0B
8
PERFORMANCE Road Signs Dataset Faces Dataset Class YOLOv2 (608x608)
Retina v1.6.0 Precision Recall Accuracy Prohibitory 96.39% 93.02% 90.63% 96.64% 91.90% 88.82% Speed Proh. 95.24% 99.62% 95.14% 99.20% 97.31% 96.56% Priority Road 95.56% 97.33% 93.73% 100.0% 90.14% Mandatory 96.09% 94.45% 91.80% 93.82% 91.56% 86.36% Warning 97.40% 92.59% 91.75% 94.44% 85.89% 82.70% Give-Way 98.46% 97.50% 78.13% Pedestrian Cr. 98.05% 95.26% 93.89% 89.47% Total 96.41% 96.19% 92.94% 97.78% 91.12% 89.39% Road Signs Dataset Faces Dataset Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Face 94.43% 84.7% 80.66% 93.04% 73.57% 69.74%
9
PERFORMANCE Road Signs Dataset Faces Dataset TOP RECALL SCORES Class
YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Prohibitory 96.39% 93.02% 90.63% 96.64% 91.90% 88.82% Speed Proh. 95.24% 99.62% 95.14% 99.20% 97.31% 96.56% Priority Road 95.56% 97.33% 93.73% 100.0% 90.14% Mandatory 96.09% 94.45% 91.80% 93.82% 91.56% 86.36% Warning 97.40% 92.59% 91.75% 94.44% 85.89% 82.70% Give-Way 98.46% 97.50% 78.13% Pedestrian Cr. 98.05% 95.26% 93.89% 89.47% Total 96.41% 96.19% 92.94% 97.78% 91.12% 89.39% Road Signs Dataset Faces Dataset Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Face 94.43% 84.7% 80.66% 93.04% 73.57% 69.74%
10
PERFORMANCE Road Signs Dataset Faces Dataset TOP PRECISION SCORES
Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Prohibitory 96.39% 93.02% 90.63% 96.64% 91.90% 88.82% Speed Proh. 95.24% 99.62% 95.14% 99.20% 97.31% 96.56% Priority Road 95.56% 97.33% 93.73% 100.0% 90.14% Mandatory 96.09% 94.45% 91.80% 93.82% 91.56% 86.36% Warning 97.40% 92.59% 91.75% 94.44% 85.89% 82.70% Give-Way 98.46% 97.50% 78.13% Pedestrian Cr. 98.05% 95.26% 93.89% 89.47% Total 96.41% 96.19% 92.94% 97.78% 91.12% 89.39% Road Signs Dataset Faces Dataset Class YOLOv2 (608x608) Retina v1.6.0 Precision Recall Accuracy Face 94.43% 84.7% 80.66% 93.04% 73.57% 69.74%
11
TESTING RESULTS Retina v1.6.0
OK Retina v1.6.0 On Road Signs Dataset Retina proves to generally have a higher precision than YOLO EX: YOLO localizes a priority road signal where there is a satellite dish OK KO: no road sign here! YOLOv2 608x608
12
TESTING RESULTS Retina v1.6.0
On Road Signs Dataset Retina proves to generally have a higher precision than YOLO EX: YOLO confuses a not classified sign with a priority road one OK OK OK YOLOv2 608x608 KO: this is not a priority road signal!
13
TESTING RESULTS Retina v1.6.0
However YOLO, thanks to its high generalization capability, is able to detect more critical instances EX: partial occlusions KO: object is not detected OK YOLOv2 608x608
14
TESTING RESULTS Retina v1.6.0
KO: object is not detected Retina v1.6.0 However YOLO, thanks to its high generalization capability, is able to detect more critical instances EX: unusual orientations OK OK YOLOv2 608x608 OK
15
COMPLEXITY ANALYSIS What kind of computational resource do you need?
(The following results are obtained using: *CPU: Intel Core i7-8700K, 3.70GHz, **GPU: Nvidia GeForce Quadro P5000, 16.0 GB) Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU* 4.9% (100) 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU** 100% (2000) 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU* 4% (73) 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU** 100% (1800) 8h 10m 50,552,889 30 ms ≈200 MB
16
COMPLEXITY ANALYSIS How many training images? Road Signs Dataset
Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% (100) 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% (2000) 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% (73) 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% (1800) 8h 10m 50,552,889 30 ms ≈200 MB
17
COMPLEXITY ANALYSIS How long does the training procedure take?
Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB
18
COMPLEXITY ANALYSIS How many parameters have to be trained?
Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB
19
COMPLEXITY ANALYSIS How many time to perform a detection?
Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB
20
COMPLEXITY ANALYSIS Amount of inner computations to perform a detection? (only multiplications are taken into account) Road Signs Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB
21
COMPLEXITY ANALYSIS Is the project easily portable? Road Signs Dataset
Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Testing Computations Project Size Retina v1.6 CPU 4.9% 9h 7m 14,364 1003 ms ≈ 712 millions 822 kB YOLOv2 608x608 GPU 100% 6h 56m 50,552,889 30 ms ≈ 2 billions (only on the first 2 Conv. Layers) ≈200 MB Faces Dataset Detector Comp. Resource % Train Images used Train Time Train. Parameters (‘float32’) Test Time Project Size Retina v1.6 CPU 4% 59m 20,800 240 ms 960 kB YOLOv2 608x608 GPU 100% 8h 10m 50,552,889 30 ms ≈200 MB
22
CONCLUSIONS (1) Property Retina v1.6 YOLO 608x608 Signs Road Faces Training images 100 73 2000 1800 Accuracy 89.39% 69.74% 92.94% 80.66% Hardware CPU GPU This work is the result of a master thesis of the Information Engineering Department at University of Brescia, whose objective is to compare the approach used by the Retina library with the more complex one based on CNN architecture as YOLO.
23
CONCLUSIONS (2) Retina was developed for industrial application in order to be easily usable and portable on traditional hardware without the complexity of GPUs. As it is demonstrated in the previous slides, Retina requires much less hardware resources and can be trained with datasets of 2 orders of magnitude lower than YOLO (pre-trained on 300K images). To evaluate the detection systems public datasets not related to the industrial world are used. Despite this, the result achieved by Retina were comparable with those of YOLO, confirming its potential.
24
CONCLUSIONS (2) Retina was developed for industrial application in order to be easily usable and portable on traditional hardware without the complexity of GPUs. As it is demonstrated in the previous slides, Retina requires much less hardware resources and can be trained with datasets of 2 orders of magnitude lower than YOLO (pre-trained on 300K images). To evaluate the detection systems public datasets not related to the industrial world are used. Despite this, the result achieved by Retina were comparable with those of YOLO, confirming its potential.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.