reducing number of kernels in CNN by using mapping just some of the input channels to each output channel?
so, I am currently learning about CNNs. And I am using pytorch to implement small models. What I don't understand, yet, is, why typically a new channel is formed by the sum of the kernel outputs of all input channels. The number of parameters in a conv(m,n,kernel) operation is:
m x n x size(kernel)
i.e. for
conv(3,5,kernel(2,4))
we would have
3 * 5 * (2 * 4) = 120
parameters, correct?
In general, for every one of the n output channels, m kernel maps need to be learned, and the outputs of each kernel convolution are summed up to form the entries in the new channel:
xo_i = sum(k_ji(xi_j)_(j=1,..,m) for each output channel i=1,...,n
where
xo_i
is the feature output of output channeli
ofn
k_ji
is the kernel map from input channelj
ofm
to output channeli
ofn
xi_j
is the feature input on thej
'th channel andk(.)
is the convolution product
Roughly, that's my understanding how the channel outputs are computed.
Apparently, the biggest advantage of CNNs is parameter sharing, introduced by the usage of the kernel maps, which slide over the input and therefore enable a large receptive field of the model with respect to the input while at the same time maintaining sparse connections between input and output.
What I don't understand is why every single output channel is typically formed as a sum of the kernels corresponding to ALL m
input channels. Would it not be even more effective, if the output channels would be formed from kernels of just a subset of input channels?
Say, with the numbers above, every of the 5 output channels would be only formed from 2 input kernel maps (instead of three). Then, instead of requiring to compute 120 parameters (/ 15 kernels), 2 * 5 * (2 * 4) = 80
(/ 10 kernels) parameters only would be computed. This is a significant reduction. The reduction could be overwhelmingly larger for large networks.
An issue would obviously be how to select the subsets of input channels that should be mapped to each output channel. For instance, in the example with two of three input channels, we could choose ((1,2),(1,3),(2,3)), and there is heuristic on how to select them. Maybe, the subsets could be selected randomly upon initialization of the model?
This could maybe even act as a form of regularization, however it would introduce more variability, when working with several instances of the same base model, because other combinations of kernels would be selected. But then again, if the number of in-channels and out-channels is large, this variation would be less significant, which is good. And, especially for large in-out-channel-models, the reduction of the number of kernels would significantly speed up the computation.
Possibly, I totally misunderstand something here, as I am just learning about this topic, so please correct me, if I am wrong. In case I am correct with my understanding, there might be still a relatively simply answer, because this is quite an obvious question that would arise, I think. Or maybe this is even done somewhere already? I appreciate any feedback!
Best, JZ
Topic kernel convolutional-neural-network deep-learning machine-learning
Category Data Science