Combining image and scalar inputs into a neural network
I'm looking at the best way of combining CNN with image input and a scalar value.
I know that one of the ways is to concatenate flatten layer with this scalar value. But flatten layer consist for example 2048 such scalar values with different magnitude than a single input value. And what if in a real task this scalar value has more influence than image. Also one of the examples is a combination of a text and image and then some fusion on top of that, but I still think it is a little different task because you get pretty the same vectors from the text model and CNN network. Another one solution is to apply some ml algorithms, like Xgboost on top of flatten layer from CNN and this scalar value. But in that case, we need to train CNN networks separately, which is not good.
Can someone tell what is the best way to combine image input with scalar value so that I can train CNN network together with scalar input and that network will "decide" which input more important?
Topic cnn ensemble-modeling deep-learning neural-network machine-learning
Category Data Science