How does attention for feature fusion works

I am struggling to understand how would a self-attention layer be used for features of different modalities fusion. What I understand until now is that :

Every unique modality is fed into a self-attention layer, this layer produces attention scores for every feature of that modality. So these scores give us information about which features are most important in that modality. And then I have read that using these scores we can find out which of the modalities is most important, and when fusing the final feature vectors, use that as weights for the modality, and the attention scores as feature weights.

Am I understanding this correctly? How is the modality weight score computed and how are feature weights assigned? By multiplying the features to their scores and concatenating ?

Here are several papers that have used attention for feature fusion, but I do not understand how it actually works and the math behind it.

Self-attention multimodal fusion,

Self attention for feature fusion in NLP,

Self attention for feature fusion in sensor data

Topic attention-mechanism deep-learning neural-network

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.