Utilizing 1x1(x1) convolutions as a learned max pooling (3D)?
I have a semantic segmentation network that ingests 3D images (hyperspectral $(x, y, b)$) and predicts 2D images (semantic map $(x, y)$). This network takes the form of a classic UNet, though it utilizes 3D convolutions on the encoding side and 2D (de)convolutions on the decoding side.
Through the skip connections I have been utilizing a 3D max-pooling to collapse the hyperspectral band dimension, $b$, to $1$ so that I may keep the receptive field and structural information through the skip connections while still being able to perform them. That is, to actually perform them I need to concatenate my 3D tensor from the encoding side onto a 2D tensor in the decoding side.
Max-pooling does the trick though I have been hinted that I may be able to utilize 1x1x1 convolutions to perform an operation with the same result, though its parameters would be learned. Thus making it a smart pooling layer. The kernel essentially would represent a single small fully connected network that learns what contributions from each band are the most beneficial to the decoding process.
I have implemented this in my network, though, I am not sure if it is technically correct. Currently my network works with 5D tensors with the dimensions $(sample, x, y, band, filter)$. To perform this 1x1 convolution (which from the inception paper and other sources performs dimension reduction in the filter dimension) I necessarily had to reshape the tensor to $(s, x, y, f, b)$, perform the operation to achieve $(s, x, y, f, 1)$, and finally reshape again to achieve $(s, x, y, f)$.
Is this utilization of a 1x1 convolution in place of a pooling layer technically correct in the manner described? Is it proper to reshape my tensors in the network as described to perform this operation?
Topic semantic-segmentation pooling convolution dimensionality-reduction
Category Data Science