How to explain the connection between the input layer and H1 of this CNN Architecture?

I am currently reading the paper proposed by LeCun et al. for handwritten zip code recognition. There is this figure below visualizing the CNN architecture. But I do not really understand how the connection between Layer H1 and input layer makes sense. If there are 12 kernels with size 5x5, shouldn't the layer H1 be 12x144? Or is there any downsampling taking place here too?

Topic cnn learning

Category Data Science


Yes, the spatial dimensions (height and width) are reduced: the input is 16x16, H1 is 8x8 and H2 is 4x4.

Also see the first paragraph in the architecture section: enter image description here Source

In modern terms you would say that they use a stride of 2. Which reduces the spatial dimensions accordingly.

EDIT (based on your comment)

The formula for the spatial output dimension $O$ of a (square shaped) convolutional layer is the following:

$$O = \frac{I - K + 2P}S + 1$$ with $I$ being the input size, $K$ being the kernel size, $P$ the padding and $S$ the stride. Now you might think that in your example $O = \frac{16 - 5 + 2*2}2 + 1 = 8.5$ (assuming $P=2$)

But take a closer look at how it actually plays out when the 5x5 kernel of layer H1 scans the 16x16 input image with a stride of 2: enter image description here

As you can see from the light grey area the required and effective padding is actually not 2 on all sides. Instead for the width or height respectively it is 2 on one side and 1 on the other side, i.e. on average $(2+1)/2=1.5$.

And if you plug that into the equation to calculate the output size it gives: $O = \frac{16 - 5 + 2*1.5}2 + 1 = 8$. Accordingly the convolutional layer H1 will have spatial dimensions of 8x8.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.