Object Detection Implementations

Object Detection Implementations
Ryan Luna Rene Reyes 4/16/2019

Methods Researched

You Only Look Once (YOLO)
YOLO’s architecture is very similar to an FCNN (Fully Connected Neural Network) YOLO has 24 convolutional layers followed by 2 fully connected layers (FC). Some convolution layers use 1 × 1 reduction layers alternatively to reduce the depth of the features maps. For the last convolution layer, it outputs a tensor with shape (7, 7, 1024). The tensor is then flattened. Using 2 fully connected layers as a form of linear regression, it outputs 7×7×30 parameters and then reshapes to (7, 7, 30), i.e. 2 boundary box predictions per location.

YOLO splits the image (n x n) into several (S x S) grid cells where each one of those cells predicts B bounding boxes. Each bounding box contains 5 predictions. The predictions made include: Coordinates (x,y) to represent the center of the bounding box. The height and width (h,w) of the box, which are predicted relative to the whole image Confidence prediction which represents the intersection over union (IOU) between the predicted box and any ground truth box. YOLO only predicts one set of class probabilities per grid cell regardless of the number of boxes generated in that cell.

After dividing the image into S x S grid cells, YOLO generates a class probability map along with the bounding boxes and confidence scores for those boxes. It’s system models detection as a regression problem. An S x S x (B*5 + C) tensor is generated through this process. The class confidence score is given by the product of the box confidence score and conditional class probability.

You Only Look Once v3 (YOLOv3)
30FPS with mAP of 57.9% on COCO test-dev using a Pascal Titan X Uses an FPN style network to run detections on three different scales by downsampling the dimensions by 32, 16, and 8 respectively. Helps to more accurately detect objects and classify on an image. Helps to detect smaller objects in an image. Generates up to 9 bounding boxes (3 for each scale) helping to optimize instance segmentation. Class Predictions As YOLO uses a softmax layer to convert scores into probabilities, YOLOv3 uses binary cross-entropy for each label to deal with non-exclusive labels to calculate the probability of the input belonging to a specific label. This also reduces computation complexity by avoiding the softmax layer. Tiny YOLOv3 simply uses scaled down tensors to speed up the time it takes to run detection on an image but this comes a cost as it loses accuracy.

Mask R-CNN An extension of Faster-RCNN(Region-based Convolutional Neural Network) Adds a branch for predicting an object mask in parallel with the branch for bounding box recognition. Uses RoIAlign as an alternative to RoIPool to better preserve exact spatial locations in an area.

Mask R-CNN FPN (Feature Pyramid Network) style deep neural network
Uses a bottom-up pathway, a top-bottom pathway, and lateral connections As we go up the spatial resolution decreases. While higher-level structures are detected in the reduced images, the semantic value for each layer is being increased. A top-down pathway is put in place to construct higher resolution layers from a semantic rich layer SSD (Single-Shot Detector) makes detection from multiple feature maps. However, the bottom layers are not selected for object detection. They are in high resolution but the semantic value is not high enough to justify its use as the speed slow-down is significant. So SSD only uses upper layers for detection and therefore performs much worse for small objects.

Mask R-CNN Mask R-CNN can be split into two stages. The 1st stage is an RPN (Regional Proposal Network). An RPN is a light-weight neural network which scans the FPN top-bottom pathway and proposes regions within the image where objects may reside. The 2nd stage is another neural network that takes the proposed regions and assigns them to specific areas of a feature map level. This is done by a technique called ROIAlign to locate the relevant areas in the feature map. After assigning the regions, it scans those same areas and then generates the object class, bounding box, and mask. (First Stage) A method to bind features to its raw image location (generate Anchors) after scanning is needed. Anchors are a set of boxes with predefined locations and scales relative to images. Ground-truth classes and bounding boxes are assigned to individual anchors according to some IoU (Intersection over Union) value. As anchors with different scales bind to different levels of feature map, RPN uses these anchors to figure out what location on the feature map ‘should’ get an object and what size of its bounding box is.

Results

Methods Used and Time per Image
Mask-RCNN processed about 5 to 10 seconds per image Yolov3 processed about 0.5 to 1 seconds per image. Test video took about 818 seconds, or about minutes. Tiny-Yolov3 processed about 0.05 to 0.08 seconds per image, but was much less accurate than the other two methods. Test video took about 85 seconds, or about 1.4 minutes.

Website Implementation

Website for Object Detection Services
Problems To Consider What language and framework to use? How to processes requests from multiple consumers, and provide asynchronous communication to show progress?

Website for Object Detection Services
Django Django is a web framework written in Python Fast to build websites, very secure and scalable framework.

Using Celery and RabbitMQ
Problems Adjusting the number of work and pool processes Number of tasks to run concurrently is limited by the number of workers Weren’t able to run M-RCNN successfully through Celery, but we were able to run by adding the code to the view.py in the app, which was not the ideal method.

Motivation

Real-Time Object Detection for Security

Real-Time Object Detection for Security
Enter Scene Leaving Scene

Future Work

Generative Adversarial Networks (GAN)

Perceptual GAN

Object Detection Implementations

Similar presentations

Presentation on theme: "Object Detection Implementations"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Object Detection Implementations

Similar presentations

Presentation on theme: "Object Detection Implementations"— Presentation transcript:

Similar presentations

About project

Feedback