Can the attention mask hold values between 0 and 1?

I am new to attention-based models and wanted to understand more about the attention mask in NLP models.

attention_mask: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. It's the mask that we typically use for attention when a batch has varying length sentences.

So a normal attention mask is supposed to look like this, for a particular sequence of length 5 (with last 2 tokens padded) -- [1,1,1,0,0].

But can we have attention mask like this -- [1, 0.8, 0.6, 0, 0] where values would be between (0 and 1) to indicate that we want to pay attention to those tokens, but it's result wouldn't be completely effective on the model's result due to it's lower attention weights (kinda of like dealing with class imbalance where we weight out certain classes to deal with imbalance).

Is this approach possible? is there some other way to have the model not use the information presented by some tokens completely?

Topic imbalanced-data attention-mechanism nlp machine-learning

Category Data Science


In theory maybe yes, but you would probably need to reimplement the model yourself.

In practice, with the current implementations, probably no. (Judging from the documentation snippet, you use Huggingface Transformers.) The documentation says it expects a LongTensor, i.e., a tensor with integer values. Internally, the attention mask is used to compute sequence lengths, but summing the mask along dimension 1. This would need to be fixed and there might many other places in the code just assume the mask values are zeros and ones.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.