Download presentation
Presentation is loading. Please wait.
1
Object Tracking: Comparison of
VGG16 and SSD - Sabhatina Selvam
2
Executive Summary: Proposed Work: Performance comparison for the two detectors. Have compiled results for: Mean average precision Validation loss Convergence time Single object vs multi object tracking Pending: Analyze more parameters for frame rate vs resolution trade offs.
3
Key difference in concept
Single Shot Detector Dense network with VGG16 base Two-stage method Stage 1: Feature extraction with VGG16(Transfer Learning) Stage 2: Bounding box regression on top One-stage method CNN with 2 parallel predictors for bounding boxes and class scores.
4
VGG16 Transfer Learning Inference
Shapes Edges Bounding box regression High-level features Figure1. Image source: Inception V3 Google Research
5
SSD Parallel layers: Figure 3: Pascal VOC cat image ROI result
Figure 2: Object detection pipeline with region of interest pooling(source: deepdense.ai)
6
GIF for better understanding..!
Input ROI selector Max pooling Figure 4: Feature map
7
Architectures Table 1: Regression units Fig 5: VGG16 network
FCN layer Neurons Activation Layer 1 4096 Leaky Relu Layer 2 1024 Layer 3 512 Layer 4 100 Layer 5 4 Linear Extract feature layers Fig 5: VGG16 network Image Source:Wei Liu, et al., 2016 Fig 6: SSD network
8
Dataset VGG16 SSD Training : Pascal VOC 2012(17,125 images)
Validation: Pascal VOC 2007(9,963 images) Testing: Pascal VOC 2007 Training : Pascal VOC COCO Validation: Partitioning training data(80:20 ratio) Testing : Pascal VOC COCO Pascal VOC 2007 20 classes: Person, Animal, Vehicle, Indoor etc. Train/validation/test:9963 images containing 24,640 annotated objects COCO: 164K complex images 80 thing classes, 91 stuff classes and 1 class unlabeled Instance-level annotations for things 5 captions per image Pascal VOC 2012 20 classes Train/Validation data has 11,530 images containing 27,450 ROI annotated objects and 6,929 segmentations. Table 2: PASCAL VOC description Table 3: COCO description
9
Customizing dataset Scaling: Read XML annotations files for PASCAL VOC dataset and COCO for bounding box coordinates and scaled them by the width and height of the image. Stored the image filenames, sizes, object names, object location, difficulty attributes in a text file. Resized input image to 3 X 224 X 224 for VGG16 and for SSD, 518 X 518 X 3(VOC) and 300 X 300 X 3(COCO).
10
Fine Tuning of VGG16 Leaky Relus and Drop-outs for regularization. Loss functions: ‘logcosh’, ‘hinge’, ‘mse’, ‘iou’ and found logcosh to be the best! Tried two optimizers: RMSPROP and SGD with the conclusion that both perform equally well.
11
Code and Software platform
VGG16 implemented with Keras on Euler with 4 NVIDEA GTX GPU. SSD implemented with PyTorch on Euler with 4 NVIDEA GTX GPU. Code for SSD taken from github: Many pull requests solved for making it work on my end. Changed some function flow for storing state dicts after every 100 iterations and weights after 1000 iterations .
12
VGG16 Training Figure 6: Validation loss vs Epoch
Figure 5. Intersection over union Input resolution : 3x224X224 Feature extractor:VGG16 Regression Loss : Logcosh Optimizer: SGD( lr=1e-2, momentum=0.9, decay=1e-6) and rmsprop IOU threshold:0.5 Accuracy:60.3% Epochs : 50 Training set: 27,188 JPEG images and annotations Convergence time: ~5 hours Figure 6: Validation loss vs Epoch
13
SSD training Input resolution: 512 X 512 (VOC), 300 X 300(COCO)
Base feature extraction model : VGG16 SGD( lr=1e-2, momentum=0.9, decay=1e-6) Localization loss: SmoothL1 Confidence loss : SoftMax loss Scaling Hard negative mining IOU threshold = 0.5 Accuracy : 75% Training time: ~10 hours Training set: 27,188 JPEG images and annotations + COCO dataset Figure 7: Class wise predictions
14
Every 100 iterations and 1000 iterations
Figure 8 : Loss vs iteration
15
State of the Art Results VGG16 SSD Mean Average Precision (mAP): 60.3%
Mean IOU:0.65 No classification, only lozalization SSD Mean AP = ~75% Total training loss:~2 Mean IOU: 0.85 Multiple-object detection. Source: Object detection: speed and accuracy comparison (Faster R-CNN, R-FCN, SSD, FPN, RetinaNet and YOLOv3)
16
References Karen Simonyan, Andrew Zisserman, Very deep neural networks for large scale image classification, (Visual Geometry Group, Department of Engineering Science, University of Oxford). Jifeng Dai, Yi Li, Kaiming He, Jian Sun, R-FCN: Object Detection via Region-based Fully Convolutional Networks, (Advances in Neural Information Processing Systems 29 (NIPS 2016)) Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng- Yang Fu,Alexander C. Berg SSD: Single Shot MultiBox Detector, (Part of the Lecture Notes in Computer Science book series (LNCS, volume 9905))
17
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.