Object Detection Creation from Scratch Samsung R&D Institute Ukraine

Slides:

Advertisements

Similar presentations

A brief review of non-neural-network approaches to deep learning

Advertisements

Lecture 6: Classification & Localization

OverFeat Part1 Tricks on Classification

Learning Convolutional Feature Hierarchies for Visual Recognition

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

1 Accurate Object Detection with Joint Classification- Regression Random Forests Presenter ByungIn Yoo CS688/WST665.

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo.

Automatic Minirhizotron Root Image Analysis Using Two-Dimensional Matched Filtering and Local Entropy Thresholding Presented by Guang Zeng.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Spatial Localization and Detection

Lecture 4b Data augmentation for CNN training

Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University

Radboud University Medical Center, Nijmegen, Netherlands

Cancer Metastases Classification in Histological Whole Slide Images

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Recent developments in object detection

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

Learning to Compare Image Patches via Convolutional Neural Networks

The Relationship between Deep Learning and Brain Function

CS 6501: 3D Reconstruction and Understanding Convolutional Neural Networks Connelly Barnes.

Summary of “Efficient Deep Learning for Stereo Matching”

Recognition of biological cells – development

Object Detection based on Segment Masks

Data Mining, Neural Network and Genetic Programming

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Computer Science and Engineering, Seoul National University

Data Driven Attributes for Action Detection

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Jure Zbontar, Yann LeCun

CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.

Classification with Perceptrons Reading:

Lecture 5 Smaller Network: CNN

Training Techniques for Deep Neural Networks

Efficient Deep Model for Monocular Road Segmentation

CS 698 | Current Topics in Data Science

CS6890 Deep Learning Weizhen Cai

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Deep Learning Convoluted Neural Networks Part 2 11/13/

A Convolutional Neural Network Cascade For Face Detection

Histogram Histogram is a graph that shows frequency of anything. Histograms usually have bars that represent frequency of occuring of data. Histogram has.

Computer Vision James Hays

Aoxiao Zhong Quanzheng Li Team HMS-MGH-CCDS

Introduction to Neural Networks

Image Classification.

Object Detection + Deep Learning

On-going research on Object Detection *Some modification after seminar

Very Deep Convolutional Networks for Large-Scale Image Recognition

Image Classification via Attribute Detection

CornerNet: Detecting Objects as Paired Keypoints

A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE

Faster R-CNN By Anthony Martinez.

YOLO-LITE: A Real-Time Object Detection Web Implementation

Outline Background Motivation Proposed Model Experimental Results

Object Tracking: Comparison of

Analysis of Trained CNN (Receptive Field & Weights of Network)

RCNN, Fast-RCNN, Faster-RCNN

CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.

Deep Object Co-Segmentation

Natalie Lang Tomer Malach

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Lecture 16. Classification (II): Practical Considerations

Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.

Object Detection Implementations

Learning Deconvolution Network for Semantic Segmentation

Multi-UAV to UAV Tracking

Volodymyr Bobyr Supervised by Aayushjungbahadur Rana

Report 2 Brandon Silva.

Introduction Face detection and alignment are essential to many applications such as face recognition, facial expression recognition, age identification,

Presentation transcript:

Object Detection Creation from Scratch Samsung R&D Institute Ukraine Vitaliy Bulygin

Problem formulation and dataset Udacity dataset near 22 000 images: 21000 – train 1000 – test Do not use bounding box with 𝑆<0.5% Problem: find bounding boxes for cars

Naive solution: sliding window convolution layer Rectangles with different aspect ration and sizes max pooling layer fully connected layer Binary classifier Yes (0,1) is car? No (1,0)

Naive solution: sliding window Very slow! Rectangles with different aspect ration and sizes Binary classifier Yes (0,1) is car? No (1,0)

Several words about two-stage detectors Proposals Two-stage detectors First stage generates proposals Second stage is classifier Two-stage detectors is slower but accurate than the single-stage However difference in accuracy becomes smaller in 2018

Naive solution: location as output 𝑥 𝑐 1 , 𝑦 𝑐 1 , 𝑤 1 , ℎ 1 𝑥 𝑐 2 , 𝑦 𝑐 2 , 𝑤 2 , ℎ 2 NN output size is 4⋅𝑁, 𝑁 is bounding box number

Naive solution: location as output We do not know the object number! 𝑥 𝑐 1 , 𝑦 𝑐 1 , 𝑤 1 , ℎ 1 𝑥 𝑐 2 , 𝑦 𝑐 2 , 𝑤 2 , ℎ 2 NN output size is 4⋅𝑁, 𝑁 is bounding box number

Output in the view of the Grid Predict rectangle and class inside cell Grid 𝑁 𝑥 × 𝑁 𝑦 Ground Truth (GT) 𝑌= 𝑦 𝑖,𝑗 𝑖,𝑗=1 𝑁 𝑥 , 𝑁 𝑦 𝑝=1 −is object 𝑝=0 −is not object 𝑦 𝑖,𝑗 = 𝑝 𝑥 𝑐 𝑦 𝑐 𝑤 ℎ rectangle center coordinates rectangle width and height

Output in the view of the Grid( calculate it!) 𝑥,𝑦,𝑤,ℎ in relative to cell coordinates 𝑥,𝑦∈[0,1] 𝑤,ℎ could be >1 if 𝑝=0⇒ set 𝑥=𝑦=𝑤=ℎ=0 𝑦 𝑖,𝑗 = 0, 0, 0, 0, 0 𝑡 , 𝑖,𝑗≠ 1,0 ,(1,1) 𝑦 1,0 =(1, 0.6, 0.6, 0.5, 0.4) 𝑦 1,1 =(1, 0.6, 0.6, 0.5, 0.4) GitHub: data_generator.py -> convert_GT_to_YOLO(...)

... Output in the view of the Grid (papers) Predict rectangle and class inside cell Recent papers with the similar output: RFB Net : Songtao Liu et al, 2018 RefineDet : Shifeng Zhang et al, 2018 YOLOv3: Joseph Redmonet al , 2018 Pelee Net: Robert J. Wang et al, 2018 FSSD: Zuo-Xin Li et al, 2018 DSOD: Zhiqiang Shen et al, 2018 ...

Output in the view of the Grid (general case) 𝐶 −class number 𝑁 −box in cell number Feature Extractor predict several boxes at the single case with aspect ration 1:1, 2:1, 1:2, 3:1, ...

Output in the view of the Grid (general case) 𝐶 −class number 𝑁 1 −box in cell number class prediction box prediction … … predict several boxes at the single case with aspect ration 1:1, 2:1, 1:2, 3:1, ... 𝑁 1 ⋅𝐶 𝑁 1 ⋅4 small objects prediction

Output in the view of the Grid (general case) 𝐶 −class number 𝑁 2 −box in cell number for middle size objects class prediction box prediction grid size is smaller … … 𝑁 2 ⋅𝐶 𝑁 2 ⋅4 middle size objects prediction

Output in the view of the Grid (general case) 𝐶 −class number 𝑁 3 −box in cell number for middle size objects box prediction class prediction … … 𝑁 2 ⋅𝐶 𝑁 2 ⋅4 large objects prediction

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator We have image dataset and GT rectangles What do we need to transform the data model input? data_preprocessing.py data_generator.py

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator 𝑰𝑰. Feature extractor model.py

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator 𝑰𝑰. Feature extractor model.py 𝑰𝑰𝑰. Model head (output) ... 𝑏𝑜 𝑥 1 𝑏𝑜 𝑥 2

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator 𝑳= 𝟏 𝑵 𝒐𝒃𝒋 𝒊=𝟏 𝑾⋅𝑯 𝜹 𝒊 𝒐𝒃𝒋 ⋅(… 𝑰𝑽) Loss function train.ipynb 𝑰𝑰. Feature extractor 𝑰𝑰𝑰. Model head (output) ... 𝑏𝑜 𝑥 1 𝑏𝑜 𝑥 2

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator 𝑳= 𝟏 𝑵 𝒐𝒃𝒋 𝒊=𝟏 𝑾⋅𝑯 𝜹 𝒊 𝒐𝒃𝒋 ⋅(… 𝑰𝑽) Loss function 𝑽) Postprocessing: filtering + NMS 𝑰𝑰. Feature extractor data_postprocessing.py 𝑰𝑰𝑰. Model head (output) ... 𝑏𝑜 𝑥 1 𝑏𝑜 𝑥 2

Single stage object detector components 𝑰. Preprocessing image normalization augmentation GT encoding batch generator 𝑳= 𝟏 𝑵 𝒐𝒃𝒋 𝒊=𝟏 𝑾⋅𝑯 𝜹 𝒊 𝒐𝒃𝒋 ⋅(… 𝑰𝑽) Loss function 𝑽) Postprocessing: filtering + NMS 𝑰𝑰. Feature extractor 𝑽𝑰) Accuracy evaluation 1 𝑰𝑰𝑰. Model head (output) precision ... 𝑏𝑜 𝑥 1 𝑏𝑜 𝑥 2 evaluator.py 1 recall

𝑰. Preprocessing : data augmentation horizontal flip vertical flip zoom in-out width and height shift rotation at some range shear image brightness shift channel shift hue changing saturation changing contrast changing gamma correction histogram equalization

𝑰. Preprocessing : data augmentation data_preprocessing.py gives more than 10% accuracy (mAP) only horizontal flip, width and height shift original augmented

𝑰. Preprocessing : data normalization data_preprocessing.py augmented normalized normalization could include : 0,255 →(0,1) or (−1,1) mean subtraction deviation division rect coor→ 0,1 ⇒ independent from scale

... ... 𝑰. Preprocessing : data generator data_preprocessing.py __getitem__() ... images GT labels ...

... ... ... 𝑰. Preprocessing : data generator data_preprocessing.py __getitem__() generate batch of 𝑿 , 𝒀 ... 𝑿 𝟏 𝑿 𝒏 images augment, normalize ... 𝒀 𝟏 𝒀 𝒏 labels ... 𝒑 𝒙 𝒄 Grid output 𝒚 𝒄 𝒘 𝒉

𝑰𝑰. Feature extractor model.py It is not an optimal feature extractor! 𝐨𝐧𝐥𝐲 𝟑×𝟑 filters 𝟑𝟎𝟎×𝟑𝟎𝟎×𝟏𝟔 𝟏𝟓𝟎×𝟏𝟓𝟎×𝟐𝟒 𝟑𝟎𝟎×𝟑𝟎𝟎×𝟑 𝟕𝟓×𝟕𝟓×𝟑𝟐 𝟑𝟕×𝟑𝟕×𝟒𝟖 𝟏𝟖×𝟏𝟖×𝟔𝟒 𝟗×𝟗×𝟔𝟒 𝐜𝐨𝐧𝐯𝐨𝐥𝐮𝐭𝐢𝐨𝐧 +ReLu 𝐦𝐚𝐱 𝐩𝐨𝐨𝐥𝐢𝐧𝐠

𝑰𝑰. Feature extractor model.py It is not an optimal feature extractor! 𝐨𝐧𝐥𝐲 𝟑×𝟑 filters 𝟑𝟎𝟎×𝟑𝟎𝟎×𝟏𝟔 𝟏𝟓𝟎×𝟏𝟓𝟎×𝟐𝟒 𝟑𝟎𝟎×𝟑𝟎𝟎×𝟑 𝟕𝟓×𝟕𝟓×𝟑𝟐 𝟑𝟕×𝟑𝟕×𝟒𝟖 𝟏𝟖×𝟏𝟖×𝟔𝟒 𝟗×𝟗×𝟔𝟒 Encoded bounding boxes Why such architecture? Why 9x9 ?

𝟐×𝟐 max pooling with stride = 𝟐 𝑰𝑰. Feature extractor : effective receptive field Effective receptive field is the area of the original image that can possibly influence the activation of a neuron. … 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 conv … 𝟎 𝟎 𝟎 … 𝟎 𝟎 𝟎 𝟎 𝑟 1 𝑐𝑜𝑛𝑣 =3 𝟑×𝟑 convolution 𝟐×𝟐 max pooling with stride = 𝟐

𝟐×𝟐 max pooling with stride = 𝟐 𝑰𝑰. Feature extractor : effective receptive field Effective receptive field is the area of the original image that can possibly influence the activation of a neuron. … 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 conv pool … 𝟎 𝟎 𝟎 𝟎 … 𝟎 𝟎 𝟎 𝑟 1 𝑐𝑜𝑛𝑣 =3 𝑟 1 𝑝𝑜𝑜𝑙 =4 𝟑×𝟑 convolution 𝟐×𝟐 max pooling with stride = 𝟐

𝟐×𝟐 max pooling with stride = 𝟐 𝑰𝑰. Feature extractor : effective receptive field Effective receptive field is the area of the input image that chosen feature looking on. … 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 conv pool conv … 𝟎 𝟎 𝟎 𝟎 … 𝟎 𝟎 𝟎 𝑟 1 𝑐𝑜𝑛𝑣 =3 𝑟 1 𝑝𝑜𝑜𝑙 =4 𝑟 2 𝑝𝑜𝑜𝑙 =8 𝟑×𝟑 convolution 𝟐×𝟐 max pooling with stride = 𝟐

𝑰𝑰. Feature extractor : effective receptive field 𝑟 3 𝑐𝑜𝑛𝑣 =18 𝑟 2 𝑐𝑜𝑛𝑣 =8 𝑟 1 𝑐𝑜𝑛𝑣 =3 31

𝑰𝑰. Feature extractor : effective receptive field 𝑟 4 𝑐𝑜𝑛𝑣 =38 𝑟 5 𝑐𝑜𝑛𝑣 =78 𝑟 3 𝑐𝑜𝑛𝑣 =18 𝑟 2 𝑐𝑜𝑛𝑣 =8 𝑟 1 𝑐𝑜𝑛𝑣 =3 32

𝑰𝑰. Feature extractor : effective receptive field 𝑟 4 𝑐𝑜𝑛𝑣 =38 𝑟 6 𝑐𝑜𝑛𝑣 =158 𝑟 5 𝑐𝑜𝑛𝑣 =78 𝑟 3 𝑐𝑜𝑛𝑣 =18 𝑟 2 𝑐𝑜𝑛𝑣 =8 𝑟 1 𝑐𝑜𝑛𝑣 =3 𝟗×𝟗 33

𝑰𝑰. Feature extractor : effective receptive field 𝑟 4 𝑐𝑜𝑛𝑣 =38 𝑟 6 𝑐𝑜𝑛𝑣 =158 𝑟 5 𝑐𝑜𝑛𝑣 =78 𝑟 3 𝑐𝑜𝑛𝑣 =18 2⋅32=64 𝑟 2 𝑐𝑜𝑛𝑣 =8 𝑟 1 𝑐𝑜𝑛𝑣 =3 𝟗×𝟗 34

𝑰𝑰. Feature extractor : effective receptive field Receptive field has to contain the object with a margin 𝟑𝐫𝐝 𝐜𝐨𝐧𝐯 𝟑𝟕×𝟑𝟕×𝟒𝟖 𝟏𝟖×𝟏𝟖 Receptive field Cannot recognize car position 2⋅32=64 𝟗×𝟗 35

𝑰𝑰. Feature extractor : effective receptive field Very large receptive field Hard to localize small objects 2⋅32=64 𝟗×𝟗 36

𝑰𝑰𝑰. Model head model.py Feature extractor Head Output 𝟑×𝟑 𝒄𝒐𝒏𝒗 37

𝑰𝑰𝑰. Model head … model.py Feature extractor Head Output Possible improvement … For smaller objects 38

𝑰𝑽. Loss function model.ipynb, function YOLO_loss(y_true, y_pred) 𝑮𝑻 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 we do not care about 𝒙,𝒚,𝒘,𝒉 of ‘no object’ cell. Compare only 𝒑 𝟎 𝒉 𝒘 𝟎 objects 𝒚 𝟎 𝒙 𝟎 ‘object’ cell 𝒑 𝟎 ‘no object’ cell 𝑳= 𝟏 𝑵 𝒐𝒃𝒋 ⋅ 𝒚 𝑮𝑻 𝒊𝒔 𝒐𝒃𝒋 𝒚 𝑮𝑻 − 𝒚 𝒑𝒓𝒆𝒅 + 𝟏 𝑵 𝒏𝒐_𝒐𝒃𝒋 ⋅ 𝒚 𝑮𝑻 𝒊𝒔 𝒏𝒐𝒕 𝒐𝒃𝒋 𝒚 𝟎 𝑮𝑻 − 𝒚 𝟎 𝒑𝒓𝒆𝒅 39

𝑰𝑽. Loss function : possible improvements 1. Hard negative mining : take into account only 3⋅𝑛 negatives, 𝑛 - positives 2. Use binary classification + cross-entropy loss function 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒊𝒆𝒔: for 𝒄 classes + 𝟏 background 𝒄𝒐𝒐𝒓𝒅𝒊𝒏𝒂𝒕𝒆𝒔: 𝒉 𝒘 𝒚 𝒑 𝒄𝒂𝒓 𝒙 𝒑 𝒏𝒐_𝒄𝒂𝒓 40

𝑰𝑽. Loss function : possible improvements 1. Hard negative mining : take into account only 3⋅𝑛 negatives, 𝑛 - positives 𝑊×𝐻×𝑀×5 𝑎𝑛𝑐ℎ𝑜𝑟 1 2. Use binary classification + cross-entropy loss function 2 3. Use 𝑴 boxes on each cells with different aspect ratio 1 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒊𝒆𝒔: for 𝒄 classes + 𝟏 background 𝒄𝒐𝒐𝒓𝒅𝒊𝒏𝒂𝒕𝒆𝒔: 𝑎𝑛𝑐ℎ𝑜𝑟 2 𝑴 times 𝒉 𝑴 times 1 𝒘 𝒚 2 𝒑 𝒄𝒂𝒓 𝒙 𝑝 1 𝑥 1 𝑦 1 𝑤 1 ℎ 1 𝑐 1 𝑐 2 𝑝 2 𝑥 2 𝑦 2 𝑤 2 ℎ 2 𝑐 1 𝑐 2 𝑏𝑜 𝑥 1 𝑏𝑜 𝑥 2 𝒑 𝒏𝒐_𝒄𝒂𝒓 41

𝑽. Postprocessing … 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 𝑝 𝑥 𝑦 𝑤 ℎ confidence relative to cell coordinates W×𝐻×5 W×𝐻 cells convert : relative to cell → relative to image 42

𝑽. Postprocessing … Filtering? 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 𝑝 𝑥 𝑦 𝑤 ℎ confidence relative to cell coordinates W×𝐻×5 Filtering? W×𝐻 cells convert : relative to cell → relative to image 43

𝑽. Postprocessing: Filtering Threshold 𝑻=𝟎.𝟓, for example 𝑁 – rectangle number with 𝑝>T 𝑝 𝑥 𝑦 𝑤 ℎ confidence relative to cell coordinates W×𝐻×5 𝑁 𝑁×4 W×𝐻 cells convert : relative to cell → relative to image scores rectangles 44

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 45

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 Compare IOU of the 1st rectangle with others 46

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 Compare IOU of the 1st rectangle with others 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅 𝐼𝑂𝑈= >0.5 47

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 𝟎.𝟖𝟐 Compare IOU of the 1st rectangle with others 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅 𝐼𝑂𝑈= >0.5 48

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 𝟎.𝟖𝟐 Compare IOU of the 1st rectangle with others 𝐼𝑂𝑈= = 0 𝒅𝒐 𝒏𝒐𝒕𝒉𝒊𝒏𝒈 49

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 𝟎.𝟖𝟐 𝟎.𝟕𝟕 Compare IOU of the 1st rectangle with others 𝐼𝑂𝑈= = 0 IOU = 0 with the chosen rectangle 𝒅𝒐 𝒏𝒐𝒕𝒉𝒊𝒏𝒈! 50

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 𝟎.𝟗 𝟎.𝟖𝟓 𝟎.𝟖𝟐 𝟎.𝟕𝟕 𝟎.𝟕𝟓 Compare IOU of the 1st rectangle with others 𝐼𝑂𝑈= < 0.5 𝒅𝒐 𝒏𝒐𝒕𝒉𝒊𝒏𝒈 51

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 1 𝟎.𝟗 𝟎.𝟖𝟐 𝟎.𝟕𝟕 𝟎.𝟕𝟓 Compare IOU of the 2nd rectangle with others 𝑁 1 ≤𝑁 because we have thrown out rectangles 52

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 1 𝟎.𝟗 𝟎.𝟖𝟐 𝟎.𝟕𝟕 𝟎.𝟕𝟓 Compare IOU of the 2nd rectangle with others 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅 𝐼𝑂𝑈= >0.5 53

𝑽. Postprocessing: non-maximum suppression 𝑟𝑒𝑐 𝑡 1 𝑟𝑒𝑐 𝑡 2 𝑟𝑒𝑐 𝑡 3 … Sorted confidence 𝑝 𝑟𝑒𝑐 𝑡 𝑁 1 𝟎.𝟗 𝟎.𝟖𝟐 𝟎.𝟕𝟕 𝐼𝑂𝑈 is calculated 𝑁−1 ⋅ 𝑁 1 −2 ⋅ 𝑁 2 −2 ⋅… 54

Correspondence between 𝑽𝑰. Accuracy evaluation 𝒏 predicted 𝟏 𝟏 𝒎 Ground Truth 𝟐 𝐼𝑜𝑈= 𝐴⋂𝐵 𝐴⋃𝐵 𝐴 – ground truth 𝐵 – detector result 𝟐 𝟑 Correspondence between GT and predicted? 55

𝑽𝑰. Accuracy evaluation Sorted array of 𝑰𝑶𝑼>𝑇 between GT and predicted (max length = 𝒏⋅𝒎) 𝟏 𝟏 ↑ ↓ 𝟐 (𝑖𝑜 𝑢 1 , 𝑖𝑜 𝑢 2 ,𝑖𝑜 𝑢 3 , …) GT 𝟐 𝟏 𝟏 pred 𝟑 𝟏 𝟐 𝟐 𝟑 evaluator.py -> sort_ious(gt_boxes, pred_boxes, iou_thr) 56

𝑽𝑰. Accuracy evaluation Sorted array of 𝑰𝑶𝑼>𝑇 between GT and predicted (max length = 𝒏⋅𝒎) 𝟏 𝟏 ↑ ↓ 𝟐 (𝑖𝑜 𝑢 1 , 𝑖𝑜 𝑢 2 ,𝑖𝑜 𝑢 3 , …) GT 𝟐 𝟏 𝟏 pred 𝟑 𝟏 𝟐 𝟐 𝟑 if appear firstly matched GT 𝟐 𝟏 evaluator.py -> matched pred 𝟑 𝟏 get_single_image_results(gt_boxes, pred_boxes, iou_thr) 57

𝑽𝑰. Accuracy evaluation : true predicted Sorted array of 𝑰𝑶𝑼>𝑇 between GT and predicted (max length = 𝒏⋅𝒎) 𝟏 𝟏 𝟐 True predicted 𝟐 𝟑 matched GT 𝟐 𝟏 matched pred 𝟑 𝟏 evaluator.py -> get_single_image_results(gt_boxes, pred_boxes, iou_thr) 58

𝑽𝑰. Accuracy evaluation : precision, recall 𝑇=0.5 getTruePredicted 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏−? 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏−? 𝒓𝒆𝒄𝒂𝒍𝒍-? 𝒓𝒆𝒄𝒂𝒍𝒍-? 𝒓𝒆𝒄𝒂𝒍𝒍= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒈𝒓𝒐𝒖𝒏𝒅 𝒕𝒓𝒖𝒕𝒉 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 What part of true predicted from all GT objects What part of predicted is true

𝑽𝑰. Accuracy evaluation : precision, recall 𝑇=0.5 getTruePredicted predicted = 1 ground truth = 2 true predicted = 1 predicted = 4 ground truth = 3 true predicted = 1 𝒓𝒆𝒄𝒂𝒍𝒍= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒈𝒓𝒐𝒖𝒏𝒅 𝒕𝒓𝒖𝒕𝒉 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 What part of true predicted from all GT objects What part of predicted is true

𝑽𝑰. Accuracy evaluation : precision, recall 𝑇=0.5 getTruePredicted predicted = 1 ground truth = 2 true predicted = 1 predicted = 4 ground truth = 3 true predicted = 1 𝒓𝒆𝒄𝒂𝒍𝒍= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒈𝒓𝒐𝒖𝒏𝒅 𝒕𝒓𝒖𝒕𝒉 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏= 𝒕𝒓𝒖𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 evaluator.py -> calc_precision_recall(img_results)

𝑽𝑰. Accuracy evaluation : confidence ↔precision, recall ↓ ↑ sorted box confidence ( 𝑝 1 , 𝑝 2 ,… 𝑝 𝑖 ,…, 𝑝 𝑁−1, 𝑝 𝑁 ) 𝟎.𝟑 𝟎.𝟐 𝟎.𝟏 𝟎.𝟒 𝟎.𝟒 Each predicted box has confidence 𝒑

𝑽𝑰. Accuracy evaluation : confidence ↔precision, recall ↓ ↑ sorted box confidence ( 𝑝 1 , 𝑝 2 ,… 𝑝 𝑖 ,…, 𝑝 𝑁−1, 𝑝 𝑁 ) 𝐼. Get boxes with 𝑝≥ 𝑝 1 , i.e. all boxes calcPrecisionRecall score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 high recall low precision For all images together!

𝑽𝑰. Accuracy evaluation : confidence ↔precision, recall ↓ ↑ sorted box confidence ( 𝑝 1 , 𝑝 2 ,… 𝑝 𝑖 ,…, 𝑝 𝑁−1, 𝑝 𝑁 ) 𝐼. Get boxes with 𝑝≥ 𝑝 1 , i.e. all boxes calcPrecisionRecall score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … For all images together!

𝑽𝑰. Accuracy evaluation : confidence ↔precision, recall ↓ ↑ sorted box confidence ( 𝑝 1 , 𝑝 2 ,… 𝑝 𝑖 ,…, 𝑝 𝑁−1, 𝑝 𝑁 ) 𝐼. Get boxes with 𝑝≥ 𝑝 1 , i.e. all boxes calcPrecisionRecall score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 For all images together! High precision low recall

𝑽𝑰. Accuracy evaluation : confidence ↔precision, recall ↓ ↑ sorted box confidence ( 𝑝 1 , 𝑝 2 ,… 𝑝 𝑖 ,…, 𝑝 𝑁−1, 𝑝 𝑁 ) 𝐼. Get boxes with 𝑝≥ 𝑝 1 , i.e. all boxes calcPrecisionRecall score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 High precision get_thr_prec_rec(…) low recall

𝑽𝑰. Accuracy evaluation : average precision calculation 𝑇 𝐼𝑂𝑈 =0.5 hundreds of values for real data sets 1 score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙

𝑽𝑰. Accuracy evaluation : average precision calculation 𝑇 𝐼𝑂𝑈 =0.5 hundreds of values for real data sets 1 score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙 𝐴 𝑃 𝑇 = 𝑟∈{0,0.1,…,1} max 𝑟 : 𝑟 ≥𝑟 𝑝( 𝑟 ) 11 recall thresholds

𝑽𝑰. Accuracy evaluation : average precision calculation 𝑇 𝐼𝑂𝑈 =0.5 hundreds of values for real data sets 1 score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙 max 𝑟 : 𝑟 ≥0.3 𝑝( 𝑟 ) A lot of mAP tutorials has misunderstanding : “area under curve” (AUC) is not the same! 𝐴 𝑃 𝑇 = 𝑟∈{0,0.1,…,1} max 𝑟 : 𝑟 ≥𝑟 𝑝( 𝑟 ) 11 recall thresholds

𝑽𝑰. Accuracy evaluation : average precision calculation 𝑇 𝐼𝑂𝑈 =0.5 hundreds of values for real data sets 1 score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙 𝐴 𝑃 𝑇 = 𝑟∈{0,0.1,…,1} max 𝑟 : 𝑟 ≥𝑟 𝑝( 𝑟 ) 𝑃𝑎𝑠𝑐𝑎𝑙 𝑉𝑂𝐶 𝑚𝑒𝑡𝑟𝑖𝑐 𝑖𝑠 𝐴 𝑃 0.5

𝑽𝑰. Accuracy evaluation : average precision calculation 𝑇 𝐼𝑂𝑈 =0.5 hundreds of values for real data sets 1 score prec recall 𝑝 1 𝟎.𝟏 𝟎.𝟗 𝑝 2 𝟎.𝟐 𝟎.𝟖 … 𝑝 𝑁 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙 𝐴𝑃= 1 10 ⋅ 𝑇∈{0.5,0.55,…,0.95} 𝐴 𝑃 𝑇 𝑀𝑆 𝐶𝑂𝐶𝑂 ℎ𝑎𝑠 𝑚𝑒𝑡𝑟𝑖𝑐𝑠 𝐴 𝑃 0.5 , 𝐴 𝑃 0.75 ,𝐴𝑃

𝑽𝑰. Accuracy evaluation : average precision calculation 𝐴𝑃= 1 10 ⋅ 𝑇∈{0.5,0.55,…,0.95} 𝐴 𝑃 𝑇 MS COCO uses 10 𝐼𝑂𝑈 thresholds 0.5, 0.55, …0.95 1 𝑇 𝐼𝑂𝑈 =0.5 𝑇 𝐼𝑂𝑈 =0.6 𝑇 𝐼𝑂𝑈 =0.7 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑇 𝐼𝑂𝑈 =0.8 𝑇 𝐼𝑂𝑈 =0.9 0.3 1 𝑟𝑒𝑐𝑎𝑙𝑙