Object Detection Creation from Scratch Samsung R&D Institute Ukraine

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

Lecture 6: Classification & Localization
OverFeat Part1 Tricks on Classification
Learning Convolutional Feature Hierarchies for Visual Recognition
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
1 Accurate Object Detection with Joint Classification- Regression Random Forests Presenter ByungIn Yoo CS688/WST665.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo.
Automatic Minirhizotron Root Image Analysis Using Two-Dimensional Matched Filtering and Local Entropy Thresholding Presented by Guang Zeng.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Spatial Localization and Detection
Lecture 4b Data augmentation for CNN training
Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University
Radboud University Medical Center, Nijmegen, Netherlands
Cancer Metastases Classification in Histological Whole Slide Images
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Recent developments in object detection
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
Learning to Compare Image Patches via Convolutional Neural Networks
The Relationship between Deep Learning and Brain Function
CS 6501: 3D Reconstruction and Understanding Convolutional Neural Networks Connelly Barnes.
Summary of “Efficient Deep Learning for Stereo Matching”
Recognition of biological cells – development
Object Detection based on Segment Masks
Data Mining, Neural Network and Genetic Programming
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Computer Science and Engineering, Seoul National University
Data Driven Attributes for Action Detection
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Jure Zbontar, Yann LeCun
CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.
Classification with Perceptrons Reading:
Lecture 5 Smaller Network: CNN
Training Techniques for Deep Neural Networks
Efficient Deep Model for Monocular Road Segmentation
CS 698 | Current Topics in Data Science
CS6890 Deep Learning Weizhen Cai
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Deep Learning Convoluted Neural Networks Part 2 11/13/
A Convolutional Neural Network Cascade For Face Detection
Histogram Histogram is a graph that shows frequency of anything. Histograms usually have bars that represent frequency of occuring of data. Histogram has.
Computer Vision James Hays
Aoxiao Zhong Quanzheng Li Team HMS-MGH-CCDS
Introduction to Neural Networks
Image Classification.
Object Detection + Deep Learning
On-going research on Object Detection *Some modification after seminar
Very Deep Convolutional Networks for Large-Scale Image Recognition
Image Classification via Attribute Detection
CornerNet: Detecting Objects as Paired Keypoints
A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE
Faster R-CNN By Anthony Martinez.
YOLO-LITE: A Real-Time Object Detection Web Implementation
Outline Background Motivation Proposed Model Experimental Results
Object Tracking: Comparison of
Analysis of Trained CNN (Receptive Field & Weights of Network)
RCNN, Fast-RCNN, Faster-RCNN
CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.
Deep Object Co-Segmentation
Natalie Lang Tomer Malach
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Lecture 16. Classification (II): Practical Considerations
Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.
Object Detection Implementations
Learning Deconvolution Network for Semantic Segmentation
Multi-UAV to UAV Tracking
Volodymyr Bobyr Supervised by Aayushjungbahadur Rana
Report 2 Brandon Silva.
Introduction Face detection and alignment are essential to many applications such as face recognition, facial expression recognition, age identification,
Presentation transcript:

Object Detection Creation from Scratch Samsung R&D Institute Ukraine Vitaliy Bulygin

Problem formulation and dataset Udacity dataset near 22 000 images: 21000 – train 1000 – test Do not use bounding box with 𝑆<0.5% Problem: find bounding boxes for cars

Naive solution: sliding window convolution layer Rectangles with different aspect ration and sizes max pooling layer fully connected layer Binary classifier Yes (0,1) is car? No (1,0)

Naive solution: sliding window Very slow! Rectangles with different aspect ration and sizes Binary classifier Yes (0,1) is car? No (1,0)

Several words about two-stage detectors Proposals Two-stage detectors First stage generates proposals Second stage is classifier Two-stage detectors is slower but accurate than the single-stage However difference in accuracy becomes smaller in 2018

Naive solution: location as output 𝑥 𝑐 1 , 𝑦 𝑐 1 , 𝑤 1 , ℎ 1 𝑥 𝑐 2 , 𝑦 𝑐 2 , 𝑤 2 , ℎ 2 NN output size is 4⋅𝑁, 𝑁 is bounding box number

Naive solution: location as output We do not know the object number! 𝑥 𝑐 1 , 𝑦 𝑐 1 , 𝑤 1 , ℎ 1 𝑥 𝑐 2 , 𝑦 𝑐 2 , 𝑤 2 , ℎ 2 NN output size is 4⋅𝑁, 𝑁 is bounding box number

Output in the view of the Grid Predict rectangle and class inside cell Grid 𝑁 𝑥 × 𝑁 𝑦 Ground Truth (GT) 𝑌= 𝑦 𝑖,𝑗 𝑖,𝑗=1 𝑁 𝑥 , 𝑁 𝑦 𝑝=1 −is object 𝑝=0 −is not object 𝑦 𝑖,𝑗 = 𝑝 𝑥 𝑐 𝑦 𝑐 𝑤 ℎ rectangle center coordinates rectangle width and height

Output in the view of the Grid( calculate it!) 𝑥,𝑦,𝑤,ℎ in relative to cell coordinates 𝑥,𝑦∈[0,1] 𝑤,ℎ could be >1 if 𝑝=0⇒ set 𝑥=𝑦=𝑤=ℎ=0 𝑦 𝑖,𝑗 = 0, 0, 0, 0, 0 𝑡 , 𝑖,𝑗≠ 1,0 ,(1,1) 𝑦 1,0 =(1, 0.6, 0.6, 0.5, 0.4) 𝑦 1,1 =(1, 0.6, 0.6, 0.5, 0.4) GitHub: data_generator.py -> convert_GT_to_YOLO(...)

... Output in the view of the Grid (papers) Predict rectangle and class inside cell Recent papers with the similar output: RFB Net : Songtao Liu et al, 2018 RefineDet : Shifeng Zhang et al, 2018 YOLOv3: Joseph Redmonet al , 2018 Pelee Net: Robert J. Wang et al, 2018 FSSD: Zuo-Xin Li et al, 2018 DSOD: Zhiqiang Shen et al, 2018 ...

Output in the view of the Grid (general case) 𝐶 −class number 𝑁 −box in cell number Feature Extractor predict several boxes at the single case with aspect ration 1:1, 2:1, 1:2, 3:1, ...

Output in the view of the Grid (general case) 𝐶 −class number 𝑁 1 −box in cell number class prediction box prediction … … predict several boxes at the single case with aspect ration 1:1, 2:1, 1:2, 3:1, ... 𝑁 1 ⋅𝐶 𝑁 1 ⋅4 small objects prediction

Output in the view of the Grid (general case) 𝐶 −class number 𝑁 2 −box in cell number for middle size objects class prediction box prediction grid size is smaller … … 𝑁 2 ⋅𝐶 𝑁 2 ⋅4 middle size objects prediction

Output in the view of the Grid (general case) 𝐶 −class number 𝑁 3 −box in cell number for middle size objects box prediction class prediction … … 𝑁 2 ⋅𝐶 𝑁 2 ⋅4 large objects prediction

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator We have image dataset and GT rectangles What do we need to transform the data model input? data_preprocessing.py data_generator.py

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator 𝑰𝑰. Feature extractor model.py

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator 𝑰𝑰. Feature extractor model.py 𝑰𝑰𝑰. Model head (output) ... 𝑏𝑜 𝑥 1 𝑏𝑜 𝑥 2

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator 𝑳= 𝟏 𝑵 𝒐𝒃𝒋 𝒊=𝟏 𝑾⋅𝑯 𝜹 𝒊 𝒐𝒃𝒋 ⋅(… 𝑰𝑽) Loss function train.ipynb 𝑰𝑰. Feature extractor 𝑰𝑰𝑰. Model head (output) ... 𝑏𝑜 𝑥 1 𝑏𝑜 𝑥 2

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator 𝑳= 𝟏 𝑵 𝒐𝒃𝒋 𝒊=𝟏 𝑾⋅𝑯 𝜹 𝒊 𝒐𝒃𝒋 ⋅(… 𝑰𝑽) Loss function 𝑽) Postprocessing: filtering + NMS 𝑰𝑰. Feature extractor data_postprocessing.py 𝑰𝑰𝑰. Model head (output) ... 𝑏𝑜 𝑥 1 𝑏𝑜 𝑥 2

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator 𝑳= 𝟏 𝑵 𝒐𝒃𝒋 𝒊=𝟏 𝑾⋅𝑯 𝜹 𝒊 𝒐𝒃𝒋 ⋅(… 𝑰𝑽) Loss function 𝑽) Postprocessing: filtering + NMS 𝑰𝑰. Feature extractor 𝑽𝑰) Accuracy evaluation 1 𝑰𝑰𝑰. Model head (output) precision ... 𝑏𝑜 𝑥 1 𝑏𝑜 𝑥 2 evaluator.py 1 recall

𝑰. Preprocessing : data augmentation horizontal flip vertical flip zoom in-out width and height shift rotation at some range shear image brightness shift channel shift hue changing saturation changing contrast changing gamma correction histogram equalization

𝑰. Preprocessing : data augmentation data_preprocessing.py gives more than 10% accuracy (mAP) only horizontal flip, width and height shift original augmented

𝑰. Preprocessing : data normalization data_preprocessing.py augmented normalized normalization could include : 0,255 →(0,1) or (−1,1) mean subtraction deviation division rect coor→ 0,1 ⇒ independent from scale

... ... 𝑰. Preprocessing : data generator data_preprocessing.py __getitem__() ... images GT labels ...

... ... ... 𝑰. Preprocessing : data generator data_preprocessing.py __getitem__() generate batch of 𝑿 , 𝒀 ... 𝑿 𝟏 𝑿 𝒏 images augment, normalize ... 𝒀 𝟏 𝒀 𝒏 labels ... 𝒑 𝒙 𝒄 Grid output 𝒚 𝒄 𝒘 𝒉

𝑰𝑰. Feature extractor model.py It is not an optimal feature extractor! 𝐨𝐧𝐥𝐲 𝟑×𝟑 filters 𝟑𝟎𝟎×𝟑𝟎𝟎×𝟏𝟔 𝟏𝟓𝟎×𝟏𝟓𝟎×𝟐𝟒 𝟑𝟎𝟎×𝟑𝟎𝟎×𝟑 𝟕𝟓×𝟕𝟓×𝟑𝟐 𝟑𝟕×𝟑𝟕×𝟒𝟖 𝟏𝟖×𝟏𝟖×𝟔𝟒 𝟗×𝟗×𝟔𝟒 𝐜𝐨𝐧𝐯𝐨𝐥𝐮𝐭𝐢𝐨𝐧 +ReLu 𝐦𝐚𝐱 𝐩𝐨𝐨𝐥𝐢𝐧𝐠

𝑰𝑰. Feature extractor model.py It is not an optimal feature extractor! 𝐨𝐧𝐥𝐲 𝟑×𝟑 filters 𝟑𝟎𝟎×𝟑𝟎𝟎×𝟏𝟔 𝟏𝟓𝟎×𝟏𝟓𝟎×𝟐𝟒 𝟑𝟎𝟎×𝟑𝟎𝟎×𝟑 𝟕𝟓×𝟕𝟓×𝟑𝟐 𝟑𝟕×𝟑𝟕×𝟒𝟖 𝟏𝟖×𝟏𝟖×𝟔𝟒 𝟗×𝟗×𝟔𝟒 Encoded bounding boxes Why such architecture? Why 9x9 ?

𝟐×𝟐 max pooling with stride = 𝟐 𝑰𝑰. Feature extractor : effective receptive field Effective receptive field is the area of the original image that can possibly influence the activation of a neuron. … 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 conv … 𝟎 𝟎 𝟎 … 𝟎 𝟎 𝟎 𝟎 𝑟 1 𝑐𝑜𝑛𝑣 =3 𝟑×𝟑 convolution 𝟐×𝟐 max pooling with stride = 𝟐

𝟐×𝟐 max pooling with stride = 𝟐 𝑰𝑰. Feature extractor : effective receptive field Effective receptive field is the area of the original image that can possibly influence the activation of a neuron. … 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 conv pool … 𝟎 𝟎 𝟎 𝟎 … 𝟎 𝟎 𝟎 𝑟 1 𝑐𝑜𝑛𝑣 =3 𝑟 1 𝑝𝑜𝑜𝑙 =4 𝟑×𝟑 convolution 𝟐×𝟐 max pooling with stride = 𝟐

𝟐×𝟐 max pooling with stride = 𝟐 𝑰𝑰. Feature extractor : effective receptive field Effective receptive field is the area of the input image that chosen feature looking on. … 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 conv pool conv … 𝟎 𝟎 𝟎 𝟎 … 𝟎 𝟎 𝟎 𝑟 1 𝑐𝑜𝑛𝑣 =3 𝑟 1 𝑝𝑜𝑜𝑙 =4 𝑟 2 𝑝𝑜𝑜𝑙 =8 𝟑×𝟑 convolution 𝟐×𝟐 max pooling with stride = 𝟐

𝑰𝑰. Feature extractor : effective receptive field 𝑟 3 𝑐𝑜𝑛𝑣 =18 𝑟 2 𝑐𝑜𝑛𝑣 =8 𝑟 1 𝑐𝑜𝑛𝑣 =3 31

𝑰𝑰. Feature extractor : effective receptive field 𝑟 4 𝑐𝑜𝑛𝑣 =38 𝑟 5 𝑐𝑜𝑛𝑣 =78 𝑟 3 𝑐𝑜𝑛𝑣 =18 𝑟 2 𝑐𝑜𝑛𝑣 =8 𝑟 1 𝑐𝑜𝑛𝑣 =3 32

𝑰𝑰. Feature extractor : effective receptive field 𝑟 4 𝑐𝑜𝑛𝑣 =38 𝑟 6 𝑐𝑜𝑛𝑣 =158 𝑟 5 𝑐𝑜𝑛𝑣 =78 𝑟 3 𝑐𝑜𝑛𝑣 =18 𝑟 2 𝑐𝑜𝑛𝑣 =8 𝑟 1 𝑐𝑜𝑛𝑣 =3 𝟗×𝟗 33

𝑰𝑰. Feature extractor : effective receptive field 𝑟 4 𝑐𝑜𝑛𝑣 =38 𝑟 6 𝑐𝑜𝑛𝑣 =158 𝑟 5 𝑐𝑜𝑛𝑣 =78 𝑟 3 𝑐𝑜𝑛𝑣 =18 2⋅32=64 𝑟 2 𝑐𝑜𝑛𝑣 =8 𝑟 1 𝑐𝑜𝑛𝑣 =3 𝟗×𝟗 34

𝑰𝑰. Feature extractor : effective receptive field Receptive field has to contain the object with a margin 𝟑𝐫𝐝 𝐜𝐨𝐧𝐯 𝟑𝟕×𝟑𝟕×𝟒𝟖 𝟏𝟖×𝟏𝟖 Receptive field Cannot recognize car position 2⋅32=64 𝟗×𝟗 35

𝑰𝑰. Feature extractor : effective receptive field Very large receptive field Hard to localize small objects 2⋅32=64 𝟗×𝟗 36

𝑰𝑰𝑰. Model head model.py Feature extractor Head Output 𝟑×𝟑 𝒄𝒐𝒏𝒗 37

𝑰𝑰𝑰. Model head … model.py Feature extractor Head Output Possible improvement … For smaller objects 38

𝑰𝑽. Loss function model.ipynb, function YOLO_loss(y_true, y_pred) 𝑮𝑻 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 we do not care about 𝒙,𝒚,𝒘,𝒉 of ‘no object’ cell. Compare only 𝒑 𝟎 𝒉 𝒘 𝟎 objects 𝒚 𝟎 𝒙 𝟎 ‘object’ cell 𝒑 𝟎 ‘no object’ cell 𝑳= 𝟏 𝑵 𝒐𝒃𝒋 ⋅ 𝒚 𝑮𝑻 𝒊𝒔 𝒐𝒃𝒋 𝒚 𝑮𝑻 − 𝒚 𝒑𝒓𝒆𝒅 + 𝟏 𝑵 𝒏𝒐_𝒐𝒃𝒋 ⋅ 𝒚 𝑮𝑻 𝒊𝒔 𝒏𝒐𝒕 𝒐𝒃𝒋 𝒚 𝟎 𝑮𝑻 − 𝒚 𝟎 𝒑𝒓𝒆𝒅 39

𝑰𝑽. Loss function : possible improvements 1. Hard negative mining : take into account only 3⋅𝑛 negatives, 𝑛 - positives 2. Use binary classification + cross-entropy loss function 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒊𝒆𝒔: for 𝒄 classes + 𝟏 background 𝒄𝒐𝒐𝒓𝒅𝒊𝒏𝒂𝒕𝒆𝒔: 𝒉 𝒘 𝒚 𝒑 𝒄𝒂𝒓 𝒙 𝒑 𝒏𝒐_𝒄𝒂𝒓 40

𝑰𝑽. Loss function : possible improvements 1. Hard negative mining : take into account only 3⋅𝑛 negatives, 𝑛 - positives 𝑊×𝐻×𝑀×5 𝑎𝑛𝑐ℎ𝑜𝑟 1 2. Use binary classification + cross-entropy loss function 2 3. Use 𝑴 boxes on each cells with different aspect ratio 1 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒊𝒆𝒔: for 𝒄 classes + 𝟏 background 𝒄𝒐𝒐𝒓𝒅𝒊𝒏𝒂𝒕𝒆𝒔: 𝑎𝑛𝑐ℎ𝑜𝑟 2 𝑴 times 𝒉 𝑴 times 1 𝒘 𝒚 2 𝒑 𝒄𝒂𝒓 𝒙 𝑝 1 𝑥 1 𝑦 1 𝑤 1 ℎ 1 𝑐 1 𝑐 2 𝑝 2 𝑥 2 𝑦 2 𝑤 2 ℎ 2 𝑐 1 𝑐 2 𝑏𝑜 𝑥 1 𝑏𝑜 𝑥 2 𝒑 𝒏𝒐_𝒄𝒂𝒓 41

𝑽. Postprocessing … 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 𝑝 𝑥 𝑦 𝑤 ℎ confidence relative to cell coordinates W×𝐻×5 W×𝐻 cells convert : relative to cell → relative to image 42

𝑽. Postprocessing … Filtering? 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 𝑝 𝑥 𝑦 𝑤 ℎ confidence relative to cell coordinates W×𝐻×5 Filtering? W×𝐻 cells convert : relative to cell → relative to image 43

𝑽. Postprocessing: Filtering Threshold 𝑻=𝟎.𝟓, for example 𝑁 – rectangle number with 𝑝>T 𝑝 𝑥 𝑦 𝑤 ℎ confidence relative to cell coordinates W×𝐻×5 𝑁 𝑁×4 W×𝐻 cells convert : relative to cell → relative to image scores rectangles 44

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 45

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 Compare IOU of the 1st rectangle with others 46

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 Compare IOU of the 1st rectangle with others 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅 𝐼𝑂𝑈= >0.5 47

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 𝟎.𝟖𝟐 Compare IOU of the 1st rectangle with others 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅 𝐼𝑂𝑈= >0.5 48

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 𝟎.𝟖𝟐 Compare IOU of the 1st rectangle with others 𝐼𝑂𝑈= = 0 𝒅𝒐 𝒏𝒐𝒕𝒉𝒊𝒏𝒈 49

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 𝟎.𝟖𝟐 𝟎.𝟕𝟕 Compare IOU of the 1st rectangle with others 𝐼𝑂𝑈= = 0 IOU = 0 with the chosen rectangle 𝒅𝒐 𝒏𝒐𝒕𝒉𝒊𝒏𝒈! 50

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 𝟎.𝟖𝟐 𝟎.𝟕𝟕 𝟎.𝟕𝟓 Compare IOU of the 1st rectangle with others 𝐼𝑂𝑈= < 0.5 𝒅𝒐 𝒏𝒐𝒕𝒉𝒊𝒏𝒈 51

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 1 𝟎.𝟗 𝟎.𝟖𝟐 𝟎.𝟕𝟕 𝟎.𝟕𝟓 Compare IOU of the 2nd rectangle with others 𝑁 1 ≤𝑁 because we have thrown out rectangles 52

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 1 𝟎.𝟗 𝟎.𝟖𝟐 𝟎.𝟕𝟕 𝟎.𝟕𝟓 Compare IOU of the 2nd rectangle with others 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅 𝐼𝑂𝑈= >0.5 53

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 1 𝟎.𝟗 𝟎.𝟖𝟐 𝟎.𝟕𝟕 𝐼𝑂𝑈 is calculated 𝑁−1 ⋅ 𝑁 1 −2 ⋅ 𝑁 2 −2 ⋅… 54

Correspondence between 𝑽𝑰. Accuracy evaluation 𝒏 predicted 𝟏 𝟏 𝒎 Ground Truth 𝟐 𝐼𝑜𝑈= 𝐴⋂𝐵 𝐴⋃𝐵 𝐴 – ground truth 𝐵 – detector result 𝟐 𝟑 Correspondence between GT and predicted? 55

𝑽𝑰. Accuracy evaluation Sorted array of 𝑰𝑶𝑼>𝑇 between GT and predicted (max length = 𝒏⋅𝒎) 𝟏 𝟏 ↑ ↓ 𝟐 (𝑖𝑜 𝑢 1 , 𝑖𝑜 𝑢 2 ,𝑖𝑜 𝑢 3 , …) GT 𝟐 𝟏 𝟏 pred 𝟑 𝟏 𝟐 𝟐 𝟑 evaluator.py -> sort_ious(gt_boxes, pred_boxes, iou_thr) 56

𝑽𝑰. Accuracy evaluation Sorted array of 𝑰𝑶𝑼>𝑇 between GT and predicted (max length = 𝒏⋅𝒎) 𝟏 𝟏 ↑ ↓ 𝟐 (𝑖𝑜 𝑢 1 , 𝑖𝑜 𝑢 2 ,𝑖𝑜 𝑢 3 , …) GT 𝟐 𝟏 𝟏 pred 𝟑 𝟏 𝟐 𝟐 𝟑 if appear firstly matched GT 𝟐 𝟏 evaluator.py -> matched pred 𝟑 𝟏 get_single_image_results(gt_boxes, pred_boxes, iou_thr) 57

𝑽𝑰. Accuracy evaluation : true predicted Sorted array of 𝑰𝑶𝑼>𝑇 between GT and predicted (max length = 𝒏⋅𝒎) 𝟏 𝟏 𝟐 True predicted 𝟐 𝟑 matched GT 𝟐 𝟏 matched pred 𝟑 𝟏 evaluator.py -> get_single_image_results(gt_boxes, pred_boxes, iou_thr) 58

𝑽𝑰. Accuracy evaluation : precision, recall 𝑇=0.5 getTruePredicted 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏−? 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏−? 𝒓𝒆𝒄𝒂𝒍𝒍-? 𝒓𝒆𝒄𝒂𝒍𝒍-? 𝒓𝒆𝒄𝒂𝒍𝒍= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒈𝒓𝒐𝒖𝒏𝒅 𝒕𝒓𝒖𝒕𝒉 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 What part of true predicted from all GT objects What part of predicted is true

𝑽𝑰. Accuracy evaluation : precision, recall 𝑇=0.5 getTruePredicted predicted = 1 ground truth = 2 true predicted = 1 predicted = 4 ground truth = 3 true predicted = 1 𝒓𝒆𝒄𝒂𝒍𝒍= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒈𝒓𝒐𝒖𝒏𝒅 𝒕𝒓𝒖𝒕𝒉 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 What part of true predicted from all GT objects What part of predicted is true

𝑽𝑰. Accuracy evaluation : precision, recall 𝑇=0.5 getTruePredicted predicted = 1 ground truth = 2 true predicted = 1 predicted = 4 ground truth = 3 true predicted = 1 𝒓𝒆𝒄𝒂𝒍𝒍= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒈𝒓𝒐𝒖𝒏𝒅 𝒕𝒓𝒖𝒕𝒉 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 evaluator.py -> calc_precision_recall(img_results)

𝑽𝑰. Accuracy evaluation : confidence ↔precision, recall ↓ ↑ sorted box confidence ( 𝑝 1 , 𝑝 2 ,… 𝑝 𝑖 ,…, 𝑝 𝑁−1, 𝑝 𝑁 ) 𝟎.𝟑 𝟎.𝟐 𝟎.𝟏 𝟎.𝟒 𝟎.𝟒 Each predicted box has confidence 𝒑

𝑽𝑰. Accuracy evaluation : confidence ↔precision, recall ↓ ↑ sorted box confidence ( 𝑝 1 , 𝑝 2 ,… 𝑝 𝑖 ,…, 𝑝 𝑁−1, 𝑝 𝑁 ) 𝐼. Get boxes with 𝑝≥ 𝑝 1 , i.e. all boxes calcPrecisionRecall score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 high recall low precision For all images together!

𝑽𝑰. Accuracy evaluation : confidence ↔precision, recall ↓ ↑ sorted box confidence ( 𝑝 1 , 𝑝 2 ,… 𝑝 𝑖 ,…, 𝑝 𝑁−1, 𝑝 𝑁 ) 𝐼. Get boxes with 𝑝≥ 𝑝 1 , i.e. all boxes calcPrecisionRecall score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … For all images together!

𝑽𝑰. Accuracy evaluation : confidence ↔precision, recall ↓ ↑ sorted box confidence ( 𝑝 1 , 𝑝 2 ,… 𝑝 𝑖 ,…, 𝑝 𝑁−1, 𝑝 𝑁 ) 𝐼. Get boxes with 𝑝≥ 𝑝 1 , i.e. all boxes calcPrecisionRecall score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 For all images together! High precision low recall

𝑽𝑰. Accuracy evaluation : confidence ↔precision, recall ↓ ↑ sorted box confidence ( 𝑝 1 , 𝑝 2 ,… 𝑝 𝑖 ,…, 𝑝 𝑁−1, 𝑝 𝑁 ) 𝐼. Get boxes with 𝑝≥ 𝑝 1 , i.e. all boxes calcPrecisionRecall score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 High precision get_thr_prec_rec(…) low recall

𝑽𝑰. Accuracy evaluation : average precision calculation 𝑇 𝐼𝑂𝑈 =0.5 hundreds of values for real data sets 1 score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙

𝑽𝑰. Accuracy evaluation : average precision calculation 𝑇 𝐼𝑂𝑈 =0.5 hundreds of values for real data sets 1 score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙 𝐴 𝑃 𝑇 = 𝑟∈{0,0.1,…,1} max 𝑟 : 𝑟 ≥𝑟 𝑝( 𝑟 ) 11 recall thresholds

𝑽𝑰. Accuracy evaluation : average precision calculation 𝑇 𝐼𝑂𝑈 =0.5 hundreds of values for real data sets 1 score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙 max 𝑟 : 𝑟 ≥0.3 𝑝( 𝑟 ) A lot of mAP tutorials has misunderstanding : “area under curve” (AUC) is not the same! 𝐴 𝑃 𝑇 = 𝑟∈{0,0.1,…,1} max 𝑟 : 𝑟 ≥𝑟 𝑝( 𝑟 ) 11 recall thresholds

𝑽𝑰. Accuracy evaluation : average precision calculation 𝑇 𝐼𝑂𝑈 =0.5 hundreds of values for real data sets 1 score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙 𝐴 𝑃 𝑇 = 𝑟∈{0,0.1,…,1} max 𝑟 : 𝑟 ≥𝑟 𝑝( 𝑟 ) 𝑃𝑎𝑠𝑐𝑎𝑙 𝑉𝑂𝐶 𝑚𝑒𝑡𝑟𝑖𝑐 𝑖𝑠 𝐴 𝑃 0.5

𝑽𝑰. Accuracy evaluation : average precision calculation 𝑇 𝐼𝑂𝑈 =0.5 hundreds of values for real data sets 1 score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙 𝐴𝑃= 1 10 ⋅ 𝑇∈{0.5,0.55,…,0.95} 𝐴 𝑃 𝑇 𝑀𝑆 𝐶𝑂𝐶𝑂 ℎ𝑎𝑠 𝑚𝑒𝑡𝑟𝑖𝑐𝑠 𝐴 𝑃 0.5 , 𝐴 𝑃 0.75 ,𝐴𝑃

𝑽𝑰. Accuracy evaluation : average precision calculation 𝐴𝑃= 1 10 ⋅ 𝑇∈{0.5,0.55,…,0.95} 𝐴 𝑃 𝑇 MS COCO uses 10 𝐼𝑂𝑈 thresholds 0.5, 0.55, …0.95 1 𝑇 𝐼𝑂𝑈 =0.5 𝑇 𝐼𝑂𝑈 =0.6 𝑇 𝐼𝑂𝑈 =0.7 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑇 𝐼𝑂𝑈 =0.8 𝑇 𝐼𝑂𝑈 =0.9 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙