In spatial transformer networks, basically, the concept of localisation network is to learn to apply a transformation to find the canonical form of the input. Imagine the output of the network $\theta$ as an activation which is passed to another layer. The point is that the sampling sequence of operations is differentiable. $\theta$ is just an output which specifies how the sampling should be performed. The sampling operation that is usually used is bilinear interpolation which although is not differentiable at all points due to the floor and ceiling functions, it can backpropagate the error and is differentiable in most of its inputs. Consider the $\theta$ just as activation which is passed to the bilinear sampler for changing the input of the next network. bilinear sampling is considered to be differentiable.

To understand it better, consider the following figure which illustrates the process inside a spatial transformer easier than the one in the original paper.

enter image description here

As it is clear, the output of the localisation network which is $\theta$ will be passed to the sampling grid. Sampling gird will be multiplied to the $\theta$ to find appropriate regions in the original image. Consider that you don't multiply $\theta$ to the original image. The reason is that if you multiply by the original image, there will be multiple choices for a single pixel while if you multiply the output of the localisation network by the sampling grid, for each entry there is just a single choice. Next, the sampled grid and the original image will be used in the interpolation to find the transformed image. As it is clear, $\theta$ is like the other activations.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.