Why do we need convolutions over volume in convolutional neural networks for image recognition?
In convolutional neural networks, we make convolutions of three channels (red, green, blue) with a filter of dimensions $k\times k\times 3$, like in the picture:
Each filter consists of adjustable weights, and can learn to detect primitive features, like edges. The filter can be different for each channel: one for R, another for G, yet another for B.
My question is: Why do we need separate filters for each channel?
If there's an edge, it will appear on each channel at the same place. The only case when this is not true is if we have lines that are purely red, green or blue which is possible only in synthetic images but not in real images.
For real images these three filters will contain the same information which means that we have three times more weights, so we need more data to train the network.
Is this construction redundant or I am missing something?
Isn't it more practical to detect features on a gray-scale image where we just need a 2-dimensional convolution? I mean, somewhere in the pre-processing step, the image can be separated into a gray-scale part and some other construction containing only a color information. Then each part is processed by a different network. Will this be more efficient? Do such networks exist already?