Is a dense layer required for implementing Bahdanau attention?

I saw that everyone adds Dense( ) layer in their custom Bahdanau attention layer, which I think isn't needed.

This is an image from a tutorial here. Here, we are just multiplying 2 vectors and then doing several operations on these vectors only. So what is the need of Dense( ) layer. Is the tutorial on 'how does attention work' wrong?

Topic attention-mechanism deep-learning machine-learning

Category Data Science


A Dense layer inside the Attention layer logic is a must. This is typically a single unit, single layer MLP (though in case of Additive Attention, there is a provision to use more units). This layer helps us derive the attention weights during training

This Dense layer transforms the input (each of the encoder LSTM layer output) keeping the dimensions the same. This is then multiplied individually with the last decoder hidden state & softmaxed to generate the attention weights.

Multiplying the attention weights with the encoder hidden states help to magnify or diminish the relevant input information and this context can then be used for Predictions. (In Additive layer addition is used instead of multiplication)

Without the Dense layer in the Attention layer, we don't get the Attention weights. Without Attention weights, there is no context. Without context there is no Attention.

So in order to incorporate Attention, a single layer Dense MLP is mandatory. The model learns the weights of this layer during training. These layer weights are then used to derive the attention weights.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.