Object detection, deep learning, and R-CNNs

Slides:

Advertisements

Similar presentations

Rich feature Hierarchies for Accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitandra Malik (UC Berkeley)

Advertisements

Histograms of Oriented Gradients for Human Detection

Lecture 6: Classification & Localization

ImageNet Classification with Deep Convolutional Neural Networks

Ivan Laptev IRISA/INRIA, Rennes, France September 07, 2006 Boosted Histograms for Improved Object Detection.

Computer Vision for Human-Computer InteractionResearch Group, Universität Karlsruhe (TH) cv:hci Dr. Edgar Seemann 1 Computer Vision: Histograms of Oriented.

Lecture 31: Modern object recognition

Many slides based on P. FelzenszwalbP. Felzenszwalb General object detection with deformable part-based models.

Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs CVPR 2005 Another Descriptor.

Object Category Detection: Sliding Windows Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/10/12.

Enhancing Exemplar SVMs using Part Level Transfer Regularization 1.

More sliding window detection: Discriminative part-based models Many slides based on P. FelzenszwalbP. Felzenszwalb.

Recognition: A machine learning approach

The SIFT (Scale Invariant Feature Transform) Detector and Descriptor

Generic object detection with deformable part-based models

Computer Vision CS 776 Spring 2014 Recognition Machine Learning Prof. Alex Berg.

“Secret” of Object Detection Zheng Wu (Summer intern in MSRNE) Sep. 3, 2010 Joint work with Ce Liu (MSRNE) William T. Freeman (MIT) Adam Kalai (MSRNE)

Object Detection with Discriminatively Trained Part Based Models

Lecture 31: Modern recognition CS4670 / 5670: Computer Vision Noah Snavely.

Pedestrian Detection and Localization

Deformable Part Model Presenter ： Liu Changyu Advisor ： Prof. Alex Hauptmann Interest ： Multimedia Analysis April 11 st, 2013.

Deformable Part Models (DPM) Felzenswalb, Girshick, McAllester & Ramanan (2010) Slides drawn from a tutorial By R. Girshick AP 12% 27% 36% 45% 49% 2005.

Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.

Object detection, deep learning, and R-CNNs

Methods for classification and image representation

Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs CVPR 2005 Another Descriptor.

CS 1699: Intro to Computer Vision Detection II: Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 12, 2015.

Object Detection Overview Viola-Jones Dalal-Triggs Deformable models Deep learning.

Recognition Using Visual Phrases

Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Object Recognizing. Object Classes Individual Recognition.

More sliding window detection: Discriminative part-based models

Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,

Deep Learning Overview Sources: workshop-tutorial-final.pdf

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Recent developments in object detection

Cascade for Fast Detection

The Relationship between Deep Learning and Brain Function

Object detection with deformable part-based models

Data Driven Attributes for Action Detection

Lecture 24: Convolutional neural networks

Lit part of blue dress and shadowed part of white dress are the same color

Recap: Advanced Feature Encoding

Object Localization Goal: detect the location of an object within an image Fully supervised: Training data labeled with object category and ground truth.

Lecture 5 Smaller Network: CNN

Object detection as supervised classification

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Non-linear classifiers Neural networks

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

Object detection.

A Tutorial on HOG Human Detection

HOGgles Visualizing Object Detection Features

An HOG-LBP Human Detector with Partial Occlusion Handling

Computer Vision James Hays

Recognition IV: Object Detection through Deep Learning and R-CNNs

Introduction to Neural Networks

Image Classification.

“The Truth About Cats And Dogs”

Smart Robots, Drones, IoT

Object Classes Most recent work is at the object level We perceive the world in terms of objects, belonging to different classes. What are the differences.

Object Detection Creation from Scratch Samsung R&D Institute Ukraine

Faster R-CNN By Anthony Martinez.

On Convolutional Neural Network

Lecture: Deep Convolutional Neural Networks

Outline Background Motivation Proposed Model Experimental Results

RCNN, Fast-RCNN, Faster-RCNN

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Semantic Segmentation

Image recognition.

Presentation transcript:

Object detection, deep learning, and R-CNNs Prof. Linda Shapiro Computer Science & Engineering University of Washington (Partly from Ross Girshick, Microsoft Research) Original slides: http://homes.cs.washington.edu/~shapiro/EE562/notes/Vision2-16.pptx

Outline Object detection Convolutional Neural Networks (CNNs) the task, evaluation, datasets Convolutional Neural Networks (CNNs) overview and history Region-based Convolutional Networks (R-CNNs)

Image classification 𝐾 classes Task: assign correct class label to the whole image Digit classification (MNIST) Object recognition (Caltech-101)

Classification vs. Detection Dog Dog

Problem formulation { airplane, bird, motorbike, person, sofa } Input Desired output

Evaluating a detector Test image (previously unseen)

First detection ... 0.9 ‘person’ detector predictions

Second detection ... 0.9 0.6 ‘person’ detector predictions

Third detection ... 0.2 0.9 0.6 ‘person’ detector predictions

Compare to ground truth 0.2 0.9 0.6 ‘person’ detector predictions ground truth ‘person’ boxes

Sort by confidence ... ... ... ... ... ✓ ✓ ✓ true positive false 0.9 0.8 0.6 0.5 0.2 0.1 ... ... ... ... ... ✓ ✓ ✓ X X X true positive (high overlap) false positive (no overlap, low overlap, or duplicate) Let’s define the problem a bit more so we’re all on the same page.

Evaluation metric ... ... ... ... ... ✓ ✓ ✓ ✓ ✓ + X 0.9 0.8 0.6 0.5 0.2 0.1 ... ... ... ... ... ✓ ✓ ✓ X X X 𝑡 ✓ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑡= #𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡 #𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡+#𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡 Let’s define the problem a bit more so we’re all on the same page. ✓ + X 𝑟𝑒𝑐𝑎𝑙𝑙@𝑡= #𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡 #𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ 𝑜𝑏𝑗𝑒𝑐𝑡𝑠

Evaluation metric ... ... ... ... ... ✓ ✓ ✓ Average Precision (AP) 0.9 0.8 0.6 0.5 0.2 0.1 ... ... ... ... ... ✓ ✓ ✓ X X X Average Precision (AP) 0% is worst 100% is best mean AP over classes (mAP) Let’s define the problem a bit more so we’re all on the same page.

Histograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005 AP ~77% More sophisticated methods: AP ~90% Pedestrians (a) average gradient image over training examples (b) each “pixel” shows max positive SVM weight in the block centered on that pixel (c) same as (b) for negative SVM weights (d) test image (e) its R-HOG descriptor (f) R-HOG descriptor weighted by positive SVM weights (g) R-HOG descriptor weighted by negative SVM weights

Overview of HOG Method 1. Compute gradients in the region to be described 2. Put them in bins according to orientation 3. Group the cells into large blocks 4. Normalize each block 5. Train classifiers to decide if these are parts of a human

Details Gradients [-1 0 1] and [-1 0 1]T were good enough filters. Cell Histograms Each pixel within the cell casts a weighted vote for an orientation-based histogram channel based on the values found in the gradient computation. (9 channels worked) Blocks Group the cells together into larger blocks, either R-HOG blocks (rectangular) or C-HOG blocks (circular).

More Details Block Normalization They tried 4 different kinds of normalization. Let  be the block to be normalized and e be a small constant.

Example: Dalal-Triggs pedestrian detector Extract fixed-sized (64x128 pixel) window at each position and scale Compute HOG (histogram of gradient) features within each window Score the window with a linear SVM classifier Perform non-maxima suppression to remove overlapping detections with lower scores Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Outperforms centered diagonal uncentered cubic-corrected Sobel Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Histogram of gradient orientations Votes weighted by magnitude Bilinear interpolation between cells Orientation: 9 bins (for unsigned angles) Histograms in 8x8 pixel cells Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Normalize with respect to surrounding cells Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

# normalizations by neighboring cells # orientations # features = 15 x 7 x 9 x 4 = 3780 X= # cells # normalizations by neighboring cells Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Training set

pos w neg w Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

pedestrian Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Detection examples

Deformable Parts Model Takes the idea a little further Instead of one rigid HOG model, we have multiple HOG models in a spatial arrangement One root part to find first and multiple other parts in a tree structure.

The Idea Articulated parts model Object is configuration of parts Each part is detectable Images from Felzenszwalb

Deformable objects Images from Caltech-256 Slide Credit: Duan Tran

Deformable objects Images from D. Ramanan’s dataset Slide Credit: Duan Tran

How to model spatial relations? Tree-shaped model

Hybrid template/parts model Detections Template Visualization Felzenszwalb et al. 2008

Pictorial Structures Model Geometry likelihood Appearance likelihood

Results for person matching

Results for person matching

BMVC 2009

2012 State-of-the-art Detector: Deformable Parts Model (DPM) Lifetime Achievement For the local window detector implementation, we used the publicly available implementation of the DPM detector. This is the best performing sliding window implementation and has consistently performed well in the VOC challenge receiving the lifetime achievement award. Strong low-level features based on HOG Efficient matching algorithms for deformable part-based models (pictorial structures) Discriminative learning with latent variables (latent SVM) Felzenszwalb et al., 2008, 2010, 2011, 2012

Why did gradient-based models work? Average gradient image

Generic categories Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …? PASCAL Visual Object Categories (VOC) dataset

Generic categories Why doesn’t this work (as well)? Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …? PASCAL Visual Object Categories (VOC) dataset

Quiz time (Back to Girshick)

This is an average image of which object class? Warm up This is an average image of which object class?

Warm up pedestrian

A little harder ?

Hint: airplane, bicycle, bus, car, cat, chair, cow, dog, dining table A little harder ? Hint: airplane, bicycle, bus, car, cat, chair, cow, dog, dining table

A little harder bicycle (PASCAL)

A little harder, yet ?

Hint: white blob on a green background A little harder, yet ? Hint: white blob on a green background

A little harder, yet sheep (PASCAL)

Impossible? ?

Impossible? dog (PASCAL)

Impossible? dog (PASCAL) Why does the mean look like this? There’s no alignment between the examples! How do we combat this?

PASCAL VOC detection history 41% 41% 37% DPM++, MKL, Selective Search Selective Search, DPM++, MKL 28% DPM++ 23% DPM, MKL 17% DPM, HOG+ BOW DPM

Part-based models & multiple features (MKL) 41% 41% rapid performance improvements 37% DPM++, MKL, Selective Search Selective Search, DPM++, MKL 28% DPM++ 23% DPM, MKL 17% DPM, HOG+ BOW DPM

Kitchen-sink approaches increasing complexity & plateau 41% 41% 37% DPM++, MKL, Selective Search Selective Search, DPM++, MKL 28% DPM++ 23% DPM, MKL 17% DPM, HOG+ BOW DPM

Region-based Convolutional Networks (R-CNNs) 62% 53% R-CNN v2 R-CNN v1 41% 41% 37% DPM++, MKL, Selective Search Selective Search, DPM++, MKL 28% DPM++ 23% DPM, MKL 17% DPM, HOG+ BOW DPM [R-CNN. Girshick et al. CVPR 2014]

Region-based Convolutional Networks (R-CNNs) ~1 year ~5 years [R-CNN. Girshick et al. CVPR 2014]

Convolutional Neural Networks Overview

Standard Neural Networks “Fully connected” 𝑔 𝑡 = 1 1+ 𝑒 −𝑡 𝒙= 𝑥 1 ,…, 𝑥 784 𝑇 𝑧 𝑗 = 𝑔(𝒘 𝑗 𝑇 𝒙)

From NNs to Convolutional NNs Local connectivity Shared (“tied”) weights Multiple feature maps Pooling

Convolutional NNs Local connectivity compare Each green unit is only connected to (3) neighboring blue units

Convolutional NNs Shared (“tied”) weights 𝑤 1 𝑤 2 𝑤 3 All green units share the same parameters 𝒘 Each green unit computes the same function, but with a different input window 𝑤 1 𝑤 2 𝑤 3

Convolutional NNs Convolution with 1-D filter: [ 𝑤 3 , 𝑤 2 , 𝑤 1 ] All green units share the same parameters 𝒘 Each green unit computes the same function, but with a different input window

Convolutional NNs Convolution with 1-D filter: [ 𝑤 3 , 𝑤 2 , 𝑤 1 ] All green units share the same parameters 𝒘 Each green unit computes the same function, but with a different input window 𝑤 3

Convolutional NNs Convolution with 1-D filter: [ 𝑤 3 , 𝑤 2 , 𝑤 1 ] All green units share the same parameters 𝒘 Each green unit computes the same function, but with a different input window 𝑤 1 𝑤 2 𝑤 3

Convolutional NNs Convolution with 1-D filter: [ 𝑤 3 , 𝑤 2 , 𝑤 1 ] All green units share the same parameters 𝒘 Each green unit computes the same function, but with a different input window 𝑤 1 𝑤 2 𝑤 3

Convolutional NNs Convolution with 1-D filter: [ 𝑤 3 , 𝑤 2 , 𝑤 1 ] All green units share the same parameters 𝒘 Each green unit computes the same function, but with a different input window 𝑤 1 𝑤 2 𝑤 3

Convolutional NNs Multiple feature maps 𝑤′ 1 𝑤′ 2 All orange units compute the same function but with a different input windows Orange and green units compute different functions 𝑤′ 3 𝑤 1 Feature map 2 (array of orange units) 𝑤 2 𝑤 3 Feature map 1 (array of green units)

Convolutional NNs Pooling (max, average) 1 Pooling area: 2 units 4 Pooling stride: 2 units Subsamples feature maps 4 4 3 3

2D input Pooling Convolution Image

1989 Backpropagation applied to handwritten zip code recognition, Lecun et al., 1989

Historical perspective – 1980

Historical perspective – 1980 Hubel and Wiesel 1962 Included basic ingredients of ConvNets, but no supervised learning algorithm

Supervised learning – 1986 Gradient descent training with error backpropagation Early demonstration that error backpropagation can be used for supervised training of neural nets (including ConvNets)

Supervised learning – 1986 “T” vs. “C” problem Simple ConvNet

Practical ConvNets Gradient-Based Learning Applied to Document Recognition, Lecun et al., 1998

Demo http://cs.stanford.edu/people/karpathy/convnetjs/ demo/mnist.html ConvNetJS by Andrej Karpathy (Ph.D. student at Stanford) Software libraries Caffe (C++, python, matlab) Torch7 (C++, lua) Theano (python)

The fall of ConvNets The rise of Support Vector Machines (SVMs) Mathematical advantages (theory, convex optimization) Competitive performance on tasks such as digit classification Neural nets became unpopular in the mid 1990s

The key to SVMs It’s all about the features HOG features SVM weights (+) (-) Histograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005

Core idea of “deep learning” Input: the “raw” signal (image, waveform, …) Features: hierarchy of features is learned from the raw input

If SVMs killed neural nets, how did they come back (in computer vision)?

What’s new since the 1980s? More layers LeNet-3 and LeNet-5 had 3 and 5 learnable layers Current models have 8 – 20+ “ReLU” non-linearities (Rectified Linear Unit) 𝑔 𝑥 = max 0, 𝑥 Gradient doesn’t vanish “Dropout” regularization Fast GPU implementations More data 𝑔(𝑥) 𝑥

What else? Object Proposals Sliding window based object detection Object proposals Fast execution High recall with low # of candidate boxes Iterate over window size, aspect ratio, and location

The number of contours wholly enclosed by a bounding box is indicative of the likelihood of the box containing an object.

Ross’s Own System: Region CNNs

Competitive Results

Top Regions for Six Object Classes