Spatial Transformer Networks

Spatial Transformer Networks
Authors: Max Jaederberg, Karen Simonyan,Andrew Zisserman and Koray Kavukcuoglu Google DeepMind Neural Information Processing Systems (NIPS) 2015 Presented by: Asher Fredman

Motivation In order to successfully classify objects, Neural networks should be invariant to spatial transformations. But are they? Scale? Rotation? Translation?

Spatial transformations

Translation

Euclidian

Affine

Spatial transformations
aspect translation rotation perspective cylindrical affine

Scale? Rotation? No. Translation? No.
CNN’s and RNN’s are not really invariant to most spatial transformation. Scale? Rotation? Translation? RNN’s – no. CNN’s – sort of… No. No.

Max-pooling layer The figure above shows the 8 possible translations of an image by one pixel. The 4x4 matrix represents 16 pixel values of an example image. When applying 2x2 max-pooling, the maximum of each colored rectangle is the new entry of the respective bold rectangle. As one can see, the output stays identical for 3 out of 8 cases, making the output to the next layer somewhat translation invariant. When performing 3x3 max-pooling, 5 out of 8 translation directions give an identical result, which means that the translation invariance rises with the size of the max-pooling. As described before, translation is only the most simple scenario of a geometric transformation. Other transformations listed in the table above can only be handled by the spatial transformer module.

Visualizing CNNs Harley, Adam W. "An interactive node-link visualization of convolutional neural networks." International Symposium on Visual Computing. Springer International Publishing, 2015.

Spatial Transformer Module!
So we would like our NN’s to be implemented in a way, that the output of the network is invariant to the position, size and rotation of an object in the image. How can we achieve this? Spatial Transformer Module!

Spatial Transformer A dynamic mechanism that actively spatially transforms an image or feature map by learning appropriate transformation matrix. Transformation matrix is capable of including translation, rotation, scaling, cropping. Allows for end to end trainable models using standard back-propagation. Further they allow the localization of objects in an image and a sub- classification of an object, such as distinguishing between the body and the head of a bird in an unsupervised manner.

Architecture of a spatial transformer module

Anything strange? Why not just transform image directly??

Image Warping quick overview
In order to understand the different components of the spatial transformer module, we need to first understand the idea behind image inverse warping. Image warping is the process of digitally manipulating an image such that any shapes portrayed in the image have been significantly distorted. Warping may be used for correcting image distortion.

Image warping

Image warping T(x,y) y’ y x x’ f(x,y) g(x’,y’)
Given a coordinate transform (x’,y’) = T(x,y) and a source image f(x,y), how do we compute a transformed image g(x’,y’) = f(T(x,y))?

Forward warping x f(x,y) x’ g(x’,y’) T(x,y) y’ y
Send each pixel f(x,y) to its corresponding location (x’,y’) = T(x,y) in the second image Q: what if pixel lands “between” two pixels?

Forward warping x f(x,y) x’ g(x’,y’) T(x,y) y’ y
Send each pixel f(x,y) to its corresponding location (x’,y’) = T(x,y) in the second image Q: what if pixel lands “between” two pixels? A: distribute color among neighboring pixels (x’,y’) Known as “splatting”

Inverse warping x f(x,y) x’ g(x’,y’) T-1(x,y) y’ y
Get each pixel g(x’,y’) from its corresponding location (x,y) = T-1(x’,y’) in the first image Q: what if pixel comes from “between” two pixels?

Inverse warping x f(x,y) x’ g(x’,y’) T-1(x,y) y’ y
Get each pixel g(x’,y’) from its corresponding location (x,y) = T-1(x’,y’) in the first image Q: what if pixel comes from “between” two pixels? A: Interpolate color value from neighbors nearest neighbor, bilinear, Gaussian, bicubic

Forward vs. inverse warping
Q: which is better? A: usually inverse—eliminates holes however, it requires an invertible warp function—not always possible... Forward vs. inverse warping

Resampling Image resampling is the process of geometrically transforming digital images x u y v This is a 2D signal reconstruction problem!

Resampling Compute weighted sum of pixel neighborhood
- Weights are normalized values of kernel function - Equivalent to convolution at samples with kernel - Find good filters using ideas of previous lectures x u y v

Point Sampling Nearest neighbor
Copies the color of the pixel with the closest integer coordinate A fast and efficient way to process textures if the size of the target is similar to the size of the reference Otherwise, the result can be a chunky, aliased, or blurred image. x u y v

Bilinear Filter Weighted sum of four neighboring pixels x u y v

Bilinear Filter Unit square - If we choose a coordinate system in which the four points where f is known are (0, 0), (0, 1), (1, 0), and (1, 1), then the interpolation formula simplifies to: f(x,y) f(0,0) (1-x)(1-y) + f(1,0) x(1-y) + f(0,1) (1-x)y + f(1,1) xy

Bilinear Filter y (i,j) (i,j+1) x (i+1,j) (i+1,j+1)

Architecture of a spatial transformer module
Back to the spatial transformer .. Architecture of a spatial transformer module

Localisation Network Takes in feature map U ∈ RH×W×C and outputs parameters of the transformation. Can be realized using fully-connected or convolutional networks regressing the transformation parameters.

Parameterised Sampling Grid (Grid Generator)
Generates sampling grid by using the transformation predicted by the localization network.

Attention Model: Source transformed grid Target regular grid Identity Transform (s=1, tx=0, ty = 0)

Affine transform: Source transformed grid Target regular grid

Differentiable Image Sampling (Sampler)
Samples the input feature map by using the sampling grid and produces the output map.

Mathematical Formulation of Sampling
Kernels Integer sampling kernel Bilinear sampling kernel

Backpropagation through Sampling Mechanism
Gradient with bilinear sampling kernel

Experiments Distorted versions of the MNIST handwriting dataset for classification A challenging real-world dataset, Street View House Numbers for number recognition CUB birds dataset for fine-grained classification by using multiple parallel spatial transformers

Experiments MNIST data that has been distorted in various ways: rotation (R), rotation, scale and translation (RTS), projective transformation (P), and elastic warping (E). Baseline fully-connected (FCN) and convolutional (CNN) neural networks are trained, as well as networks with spatial transformers acting on the input before the classification network (ST-FCN and ST-CNN).

Experiments The spatial transformer networks all use different transformation functions: an affine (Aff), projective (Proj), and a 16-point thin plate spline transformations (TPS)

Affine Transform (error %)
Experiments Affine Transform (error %) 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 1.5 1.4 1.2 1.2 0.8 0.8 CNN ST-CNN 0.7 0.5 R RTS P E

Projective Transform (error %)
Experiments Projective Transform (error %) 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 1.5 1.4 1.3 1.2 0.8 CNN ST-CNN 0.6 R RTS P E

Thin Plate Spline (error %)
Experiments Thin Plate Spline (error %) 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 1.5 1.4 1.2 1.1 0.8 0.8 CNN ST-CNN 0.7 0.5 R RTS P E

Distorted MNIST Results R : Rotation
RTS : Rotation, Translation and Scaling : Inputs to the network : Transformation applied by the Spatial Transformer Network : Output of the Spatial Transformer Network P :Projective Distortion E: Elastic Distortion

Experiments Street View House Numbers (SVHN) This dataset contains around 200k real world images of house numbers, with the task to recognise the sequence of numbers in each image

Experiments Data is preprocessed by taking 64 × 64 crops and more loosely 128×128 crops around each digit sequence

Comperative results (error %)
Experiments Comperative results (error %) 6 5.6 5 4.5 4 4 3.9 3.73.9 3.9 4 3.6 3 64 128 2 1 Maxout CNN Ours DRAM ST-CNN Single ST-CNN Multi

Experiments Fine-Grained Classification CUB birds dataset contains 6k training images and 5.8k test images, covering species of birds. The birds appear at a range of scales and orientations, are not tightly cropped. Only image class labels are used for training.

Experiments Baseline CNN model is an Inception architecture with batch normalisation pretrained on ImageNet and fine-tuned on CUB. It achieved the state-of-theart accuracy of % (previous best result is 81.0%). Then, spatial transformer network, ST-CNN, which contains 2 or 4 parallel spatial transformers are trained.

Experiments The transformation predicted by 2×ST-CNN (top row) and 4×ST-CNN (bottom row)

Experiments One of the transformers learns to detect heads, while the other detects the body.

Experiments The accuracy on CUB (%) 82.3 83.1 83.9 84.1 74.9 75.7 85
80.9 81 80 75 70 66.7 65

Conclusion Introduced a new module - spatial transformer.
Helps in learning explicit spatial transformations like translation, rotation, scaling, cropping, non-rigid deformations, etc. of features. Can be used in any networks and at any layer and learnt in an end-to-end trainable manner. Provides improvement in the performance of existing models.

Thank you! QUESTIONS?

Spatial Transformer Networks

Similar presentations

Presentation on theme: "Spatial Transformer Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spatial Transformer Networks

Similar presentations

Presentation on theme: "Spatial Transformer Networks"— Presentation transcript:

Similar presentations

About project

Feedback