Rotational Rectification Network (R2N):

Slides:

Advertisements

Similar presentations

Object recognition and scene “understanding”

Advertisements

November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.

Learning Convolutional Feature Hierarchies for Visual Recognition

Virtual Dart: An Augmented Reality Game on Mobile Device Supervisor: Professor Michael R. Lyu Prepared by: Lai Chung Sum Siu Ho Tung.

Fitting a Model to Data Reading: 15.1,

Spatial Pyramid Pooling in Deep Convolutional

Generic object detection with deformable part-based models

(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)

Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.

Representations for object class recognition David Lowe Department of Computer Science University of British Columbia Vancouver, Canada Sept. 21, 2006.

BING: Binarized Normed Gradients for Objectness Estimation at 300fps

Deep Convolutional Nets

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.

Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,

Spatial Localization and Detection

Deep Residual Learning for Image Recognition

Gaussian Conditional Random Field Network for Semantic Segmentation

Recent developments in object detection

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

Binary Notation and Intro to Computer Graphics

Object Detection based on Segment Masks

an introduction to: Deep Learning

Object detection with deformable part-based models

Inverse Compositional Spatial Transformer Networks

Data Mining, Neural Network and Genetic Programming

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

A Neural Approach to Blind Motion Deblurring

Article Review Todd Hricik.

Lecture 26 Hand Pose Estimation Using a Database of Hand Images

Depth estimation and Plane detection

Regularizing Face Verification Nets To Discrete-Valued Pain Regression

1-Introduction (Computing the image histogram).

Rotational Rectification Network for Robust Pedestrian Detection

Compositional Human Pose Regression

Lecture 25: Backprop and convnets

Paper Presentation: Shape and Matching

Spatial Data Models Raster uses individual cells in a matrix, or grid, format to represent real world entities Vector uses coordinates to store the shape.

Convolutional Networks

CS6890 Deep Learning Weizhen Cai

Non-linear classifiers Neural networks

Adversarially Tuned Scene Generation

Enhanced-alignment Measure for Binary Foreground Map Evaluation

Bird-species Recognition Using Convolutional Neural Network

Computer Vision James Hays

Introduction to Neural Networks

Image Classification.

Object Detection + Deep Learning

Single Image Rolling Shutter Distortion Correction

CornerNet: Detecting Objects as Paired Keypoints

KFC: Keypoints, Features and Correspondences

Object Detection Creation from Scratch Samsung R&D Institute Ukraine

On Convolutional Neural Network

Outline Background Motivation Proposed Model Experimental Results

RCNN, Fast-RCNN, Faster-RCNN

CSE 185 Introduction to Computer Vision

An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,

Reuben Feinman Research advised by Brenden Lake

Human-object interaction

Scalable light field coding using weighted binary images

Learning and Memorization

Image recognition.

Object Detection Implementations

Unrolling the shutter: CNN to correct motion distortions

End-to-End Facial Alignment and Recognition

SFNet: Learning Object-aware Semantic Correspondence

Point Set Representation for Object Detection and Beyond

SDSEN: Self-Refining Deep Symmetry Enhanced Network

Directional Occlusion with Neural Network

20 November 2019 Output maps Normal Diffuse Roughness Specular

Presentation transcript:

Rotational Rectification Network (R2N): Enabling Pedestrian Detection for Mobile Vision Xinshuo Weng1, Shangxuan Wu1, Fares Beainy2, Kris M. Kitani1 1Carnegie Mellon University, 2Volvo Construction Equipment WACV 2018, Lake Tahoe Hi everyone, I’m Xinshuo Weng from CMU. This work called R2N, rotational rectification network, is done with Shangxuan Wu, Kris Kitani from CMU and Fares Beainy from Volvo.

Pedestrian Detection Pedestrian detection is a very hot topic in computer vision over the past decade. Given an input image, the goal is to localize the pedestrians.

Pedestrian Detection Results on Caltech dataset Lots of works have been done in this area. This figure is a summary of results on Caltech dataset by 2016. You can see that the best entry at that time can already achieve an average miss rate less than 10%, which is pretty good. Zhang et al. Is Faster R-CNN Doing Well for Pedestrian Detection? ECCV, 2016.

Arbitrary-Oriented Pedestrian Detection Okay, now. What if we rotate the input image and run the same detector? What do you think will happen?

Arbitrary-Oriented Pedestrian Detection As you can see in the bottom row, even if the detector is exactly the same as before and we just add a little bit angle to the input image, it leads to inferior detection performance. The key difference here is that the pedestrians are not upright any more after the rotation.

Arbitrary-Oriented Pedestrian Detection Random failure cases on Caltech dataset. Here are some more random failure cases. When there are no bounding box, it means that the detector totally fails.

Why is it interesting? Imagine the cases: Mobile phones So, why is it interesting? Why do we need to study the arbitrary-oriented pedestrian detection? Well, the truth is, in the real world there are lots of cases where the pedestrians are not upright. This is an image I took yesterday during the poster session. Obviously, people are not upright even if I did it intentionally. But what I am trying to say is that there is a big chance to have an angled picture when using mobile phones.

Why is it interesting? Imagine the cases: Mobile phones UAVs/drones Similar for cameras on the drones.

Why is it interesting? Imagine the cases: Mobile phones UAVs/drones Construction vehicles on a rugged terrain Same things happen when driving a construction vehicle on a rugged road.

Why is it interesting? Imagine the cases: Mobile phones UAVs/drones Construction vehicles on a rugged terrain Wearable cameras …. And of course, the wearable cameras, such as gopro.

Why is it interesting? Imagine the cases: Mobile phones UAVs/drones Construction vehicles on a rugged terrain Wearable cameras …. Camera orientation can be very flexible with respect to the ground in the real world. In all of these cases in the real world, the camera orientation can be flexible to the ground.

Modeling Rotation Invariance or Equivariance In order to have the an expected results shown in the figure, it is straightforward to think about modeling the rotation invariance or equivarience in the network.

Modelling Rotation Invariance/Equivariance Rotating the inputs Data augmentation TI-Pooling [Laptev et al CVPR’ 16] …. Cons: Low efficiency More parameters Rotating the filters Changing sampling grids Most methods can be categorized into 3 families, rotating the inputs, filters or changing the sampling grids. Obviously, data augmentation belongs to the first category. The main disadvantage of the first family is that they often suffer from learning lots of redundant filters with different angles in low-level layers.

Modelling Rotation Invariance/Equivariance Rotating the inputs Data augmentation TI-Pooling [Laptev et al, CVPR’ 16] …. Cons: Low efficiency More parameters Rotating the filters RotEqNet [Marcos et al, ICCV’ 17] ORNs [Zhou et al, CVPR’ 17] …. Cons: Approximated rotations Memory issues Changing sampling grids We can instead rotate the filters, which however produces more intermediate activation maps with different angles. So we can only have a few filters due to memory issues. Usually, people use 4 filters with an interval of 90 degrees. So the small rotation can only be approximated.

Modelling Rotation Invariance/Equivariance Rotating the inputs Data augmentation TI-Pooling [Laptev et al, CVPR’ 16] …. Cons: Low efficiency More parameters Rotating the filters RotEqNet [Marcos et al, ICCV’ 17] ORNs [Zhou et al, CVPR’ 17] …. Cons: Approximated rotations Memory issues Changing sampling grids Spatial Transformer [Jaderberg et al, NIPS’ 15] Deformable ConvNets [Dai et al, ICCV’ 17] GPPooling (Ours) …. The third family is to change the sampling grid. In this way, rotation can be accurately modeled. This is usually a smarter way. Our proposed method called, Global polar pooling, or GPPooling, also belongs to the third family.

Global Polar Pooling (GPPooling) This is a demo to show what activations can be produced on the right by our GPPooling when rotating the inputs on the left. The key observation here is that the rotational changes in the inputs leads to translational shifts in the output activations, which can be easily recognized by higher-level layers because, as we all know, the translation equivariance is naturally encoded in CNNs. Inputs Activations

GPPooling vs Pooling GPPooling Pooling The core idea of GPPooling is very simple. We basically define the sampling grid in a polar coordinate instead of a rectangular one compared to the general pooling layer. The stride and kernel size are defined in a similar spirit, but along radial and angular axes. If you take the max from the sampled grid shown in the figure, so you got max-GPPooling layer. The kernel size and stride can be defined to a very small value if one wants to model a small amount of rotation. Noh et al. Learning Deconvolution Network for Semantic Segmentation? ICCV, 2015.

What is Rotational Rectification Network (R2N)? R2N = Rotation Estimation Module (including GPPooling) + Spatial Transformer In order to solve the oriented pedestrian detection, we propose the R2N, which is the light blue box shown in the figure. R2N is basically composed of a rotation estimation module (the dark blue one) and a spatial transformer (the green one). The detector in the yellow box can be any general detection framework, such as faster-rcnn. If you look at the rotation estimation module closely, we use the proposed GPPooling inside it to capture any rotation present in the inputs. The estimated rotation is then passed through transformer to rectify the image features such that the pedestrians are converted to be upright in the feature space. In this way, you can detect the angled pedestrians as easily as detect the upright pedestrians.

Results Here are some example results. You can get a much better tight bounding box with an estimated angle here.

Take Home Messages GPPooling can be used to model global rotation equivariance/invariance in general CNNs. R2N is easy to plug in and improves the performance on oriented detection without bells and whistles. Some take home messages. If you want to model the rotation equivariance, use GPPooling. It is just a counterpart of general pooling layer. If you have a task of detecting oriented objects, have a try with R2N. If you think our work is interesting, please remember or mark down our title, rotational rectification network and stop by our poster. Thanks!