Attention weights - change during learning and prediction

Assume a simple LSTM Followed by Attention layer or a full transformer architecture. The attention weights are learnt during training, which get multiplied with keys, queries and values.

Please correct if my above understanding is wrong or below question.

The question is, when these weights of attention layer gets changed and when not.

  1. Do attention layer weights change for each input in sequence? (I assume no, but please confirm)
  2. Do attention layer weights get frozen during prediction (inference)? Or these keep on changing?
  3. In transformers or Bert, were these weights supplied as part of pretrained model?

Topic transformer attention-mechanism sequence-to-sequence

Category Data Science


The term "attention weights" seems overloaded to me, as you may refer to the computed attention weights applied to the weighted sum of the values, or you may be referring to the attention head parameters, which are learned during training. I assume you are referring to attention head parameters.

With that assumption:

  1. During training, the attention head parameters are the same for all sequences in the same optimization step (i.e. in each batch). After each optimization step, the parameters are updated and, therefore, the next batch will use the new parameter values.

  2. The attention head parameters are learned during training and, like all the other model parameters, do not change at inference time.

  3. Yes, in pretrained Transformers the attention head parameters are part of the pretrained model.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.