Rotational Rectification Network (R2N): Enabling Pedestrian Detection for Mobile Vision Xinshuo Weng1, Shangxuan Wu1, Fares Beainy2, Kris M. Kitani1 1Carnegie Mellon University, 2Volvo Construction Equipment WACV 2018, Lake Tahoe Hi everyone, I’m Xinshuo Weng from CMU. This work called R2N, rotational rectification network, is done with Shangxuan Wu, Kris Kitani from CMU and Fares Beainy from Volvo.
Pedestrian Detection Pedestrian detection is a very hot topic in computer vision over the past decade. Given an input image, the goal is to localize the pedestrians.
Pedestrian Detection Results on Caltech dataset Lots of works have been done in this area. This figure is a summary of results on Caltech dataset by 2016. You can see that the best entry at that time can already achieve an average miss rate less than 10%, which is pretty good. Zhang et al. Is Faster R-CNN Doing Well for Pedestrian Detection? ECCV, 2016.
Arbitrary-Oriented Pedestrian Detection Okay, now. What if we rotate the input image and run the same detector? What do you think will happen?
Arbitrary-Oriented Pedestrian Detection As you can see in the bottom row, even if the detector is exactly the same as before and we just add a little bit angle to the input image, it leads to inferior detection performance. The key difference here is that the pedestrians are not upright any more after the rotation.
Arbitrary-Oriented Pedestrian Detection Random failure cases on Caltech dataset. Here are some more random failure cases. When there are no bounding box, it means that the detector totally fails.
Why is it interesting? Imagine the cases: Mobile phones So, why is it interesting? Why do we need to study the arbitrary-oriented pedestrian detection? Well, the truth is, in the real world there are lots of cases where the pedestrians are not upright. This is an image I took yesterday during the poster session. Obviously, people are not upright even if I did it intentionally. But what I am trying to say is that there is a big chance to have an angled picture when using mobile phones.
Why is it interesting? Imagine the cases: Mobile phones UAVs/drones Similar for cameras on the drones.
Why is it interesting? Imagine the cases: Mobile phones UAVs/drones Construction vehicles on a rugged terrain Same things happen when driving a construction vehicle on a rugged road.
Why is it interesting? Imagine the cases: Mobile phones UAVs/drones Construction vehicles on a rugged terrain Wearable cameras …. And of course, the wearable cameras, such as gopro.
Why is it interesting? Imagine the cases: Mobile phones UAVs/drones Construction vehicles on a rugged terrain Wearable cameras …. Camera orientation can be very flexible with respect to the ground in the real world. In all of these cases in the real world, the camera orientation can be flexible to the ground.
Modeling Rotation Invariance or Equivariance In order to have the an expected results shown in the figure, it is straightforward to think about modeling the rotation invariance or equivarience in the network.
Modelling Rotation Invariance/Equivariance Rotating the inputs Data augmentation TI-Pooling [Laptev et al CVPR’ 16] …. Cons: Low efficiency More parameters Rotating the filters Changing sampling grids Most methods can be categorized into 3 families, rotating the inputs, filters or changing the sampling grids. Obviously, data augmentation belongs to the first category. The main disadvantage of the first family is that they often suffer from learning lots of redundant filters with different angles in low-level layers.
Modelling Rotation Invariance/Equivariance Rotating the inputs Data augmentation TI-Pooling [Laptev et al, CVPR’ 16] …. Cons: Low efficiency More parameters Rotating the filters RotEqNet [Marcos et al, ICCV’ 17] ORNs [Zhou et al, CVPR’ 17] …. Cons: Approximated rotations Memory issues Changing sampling grids We can instead rotate the filters, which however produces more intermediate activation maps with different angles. So we can only have a few filters due to memory issues. Usually, people use 4 filters with an interval of 90 degrees. So the small rotation can only be approximated.
Modelling Rotation Invariance/Equivariance Rotating the inputs Data augmentation TI-Pooling [Laptev et al, CVPR’ 16] …. Cons: Low efficiency More parameters Rotating the filters RotEqNet [Marcos et al, ICCV’ 17] ORNs [Zhou et al, CVPR’ 17] …. Cons: Approximated rotations Memory issues Changing sampling grids Spatial Transformer [Jaderberg et al, NIPS’ 15] Deformable ConvNets [Dai et al, ICCV’ 17] GPPooling (Ours) …. The third family is to change the sampling grid. In this way, rotation can be accurately modeled. This is usually a smarter way. Our proposed method called, Global polar pooling, or GPPooling, also belongs to the third family.
Global Polar Pooling (GPPooling) This is a demo to show what activations can be produced on the right by our GPPooling when rotating the inputs on the left. The key observation here is that the rotational changes in the inputs leads to translational shifts in the output activations, which can be easily recognized by higher-level layers because, as we all know, the translation equivariance is naturally encoded in CNNs. Inputs Activations
GPPooling vs Pooling GPPooling Pooling The core idea of GPPooling is very simple. We basically define the sampling grid in a polar coordinate instead of a rectangular one compared to the general pooling layer. The stride and kernel size are defined in a similar spirit, but along radial and angular axes. If you take the max from the sampled grid shown in the figure, so you got max-GPPooling layer. The kernel size and stride can be defined to a very small value if one wants to model a small amount of rotation. Noh et al. Learning Deconvolution Network for Semantic Segmentation? ICCV, 2015.
What is Rotational Rectification Network (R2N)? R2N = Rotation Estimation Module (including GPPooling) + Spatial Transformer In order to solve the oriented pedestrian detection, we propose the R2N, which is the light blue box shown in the figure. R2N is basically composed of a rotation estimation module (the dark blue one) and a spatial transformer (the green one). The detector in the yellow box can be any general detection framework, such as faster-rcnn. If you look at the rotation estimation module closely, we use the proposed GPPooling inside it to capture any rotation present in the inputs. The estimated rotation is then passed through transformer to rectify the image features such that the pedestrians are converted to be upright in the feature space. In this way, you can detect the angled pedestrians as easily as detect the upright pedestrians.
Results Here are some example results. You can get a much better tight bounding box with an estimated angle here.
Take Home Messages GPPooling can be used to model global rotation equivariance/invariance in general CNNs. R2N is easy to plug in and improves the performance on oriented detection without bells and whistles. Some take home messages. If you want to model the rotation equivariance, use GPPooling. It is just a counterpart of general pooling layer. If you have a task of detecting oriented objects, have a try with R2N. If you think our work is interesting, please remember or mark down our title, rotational rectification network and stop by our poster. Thanks!