How does attention for feature fusion works
I am struggling to understand how would a self-attention layer be used for features of different modalities fusion. What I understand until now is that :
Every unique modality is fed into a self-attention layer, this layer produces attention scores for every feature of that modality. So these scores give us information about which features are most important in that modality. And then I have read that using these scores we can find out which of the modalities is most important, and when fusing the final feature vectors, use that as weights for the modality, and the attention scores as feature weights.
Am I understanding this correctly? How is the modality weight score computed and how are feature weights assigned? By multiplying the features to their scores and concatenating ?
Here are several papers that have used attention for feature fusion, but I do not understand how it actually works and the math behind it.
Self-attention multimodal fusion,
Topic attention-mechanism deep-learning neural-network
Category Data Science