Why do we need convolutions over volume in convolutional neural networks for image recognition?

In convolutional neural networks, we make convolutions of three channels (red, green, blue) with a filter of dimensions $k\times k\times 3$, like in the picture:

Each filter consists of adjustable weights, and can learn to detect primitive features, like edges. The filter can be different for each channel: one for R, another for G, yet another for B.

My question is: Why do we need separate filters for each channel?

If there's an edge, it will appear on each channel at the same place. The only case when this is not true is if we have lines that are purely red, green or blue which is possible only in synthetic images but not in real images.

For real images these three filters will contain the same information which means that we have three times more weights, so we need more data to train the network.

Is this construction redundant or I am missing something?

Isn't it more practical to detect features on a gray-scale image where we just need a 2-dimensional convolution? I mean, somewhere in the pre-processing step, the image can be separated into a gray-scale part and some other construction containing only a color information. Then each part is processed by a different network. Will this be more efficient? Do such networks exist already?

Topic image-preprocessing cnn image-recognition image-classification neural-network

Category Data Science


The number of neurons and architecture of the model are hyperparameters. There's no reason you couldn't test greyscale architectures alongside 3-channel models.

Depending on data set and problem, a model might actually perform better by compressing the training data down to greyscale. Eg., a small data set might overfit more with colored images than greyscale.


in medical images often colors are very vivid and in channels looks very different take a look at these contests data https://www.kaggle.com/c/human-protein-atlas-image-classification https://www.kaggle.com/c/data-science-bowl-2018 images on different chanels differ sometimes a lot, so if you have picture of bear, car or any "normal" picture probably in most cases you can get away with greyscale or analyzing just one chanel, so having convolutions over volume will produce better models


I'm not sure if the information in the colours is as redundant for real images as you say. I could imagine a scenario where the colours around an edge are relevant, e.g. training a network to tell brown bears from polar bears. In terms of edges and shapes they are very similar, the colour however is not. So while the edges show up in all channels, I would guess that they show up with different contrast, thus propagating colour-information into the classifier.

If you check out this blog where the author visualized some of the learned features, it seems that some features are fairly similar in terms of shape, but not in colour.

There might be a work-around, as you describe it, where you separate shapes and colours, but I think that would involve a lot of prior knowledge and it might be easier to just let the network figure this out during training. This is true even more so if you go to multi-spectral images as they come from EO-satellites for example. There you definitely have features popping up only in certain channels, but not so much in others.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.