How does inception decrease the computational cost?

From the second paragraph of 3.1 Factorization into smaller convolution in the paper Rethinking the inception architecture for computer vision:

This setup clearly reduces the parameter count by shar- ing the weights between adjacent tiles. To analyze the expected computational cost savings,

...

This way, we end up with a net 9+9 × reduction of computation, resulting 25 in a relative gain of 28% by this factorization

Apprantly this design decreases the number of parameters. But I can't understand why it decreases the computational cost?

For the case of using 2 3*3 convnets to replace 5*5, I think it increases (3*3*9+3*3)/5*5 = 3.6 times of computational cost.

What do I miss here?

Topic inception convolutional-neural-network

Category Data Science


Just to add on a bit. I'm not sure where you got "(3*3*9+3*3)/5*5". A 3x3 filter has 9 parameters and a 5x5 has 25 paraemters. Therefore, even if you stack two 3x3 filters, you still only have 18 compared to the 25. That is in fact a 28% reduction in parameters that need to be calculated. This simple reduction spread out across an entire network can dramatically improve speeds. In fact, you can even take this principle and apply a nx1 filter that moves only moves across the horizon followed by a 1xn flter that moves vertically to see similar gains. (Although this may introduce bottlenecks.)

I think a good way to understand the computational cost is to actually calculate a simple filter by hand. Keep track of all the operations you're doing and find out what is reusable.

Lastly, i'll also mention that stacking two 3x3 kernels gives you a receptive field of a 5x5 kernel. Stacking three of 3x3 kernels gives you a receptive field of a 7x7 and so on and so forth. Understanding the receptive field is critical for any computer vision task. Here's some more info DCNN Receptive Field Info


However, in the overall scheme, sliding this network can be represented by two 3 x 3 convolutional layers which reuses the activation between adjacent tiles.

Since the replacement (two 3x3 instead of one 5x5) share weights, we don't have to calculate them twice. That is where the gain comes from.

Edit:

The gain comes from the sliding: using the following formula (n-filtersize+1) x (n-filtersize+1) that calculates the output of a filter on an n x n input

a more detailed answer can be found here: Reducing Filter Size in Convolutional Neural Network Thank you @Thomas W.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.