Why does Bahdanau Attention Have to be Causal?

Question

Why does Bahdanau Attention Have to be Causal?

Della

2021年12月27日 13:05

Using the Bahdanau attention layer on Tensorflow for time series prediction, although conceptually it is similar to NLP applications.

This is how the minimal example code for a single layer looks like.

import tensorflow as tf
dim=7
Tq=5 # Number of future time steps to predict
Tv=13 # Number of historic lag timesteps to consider
batch_size=2**4
query=tf.random.uniform(shape=(batch_size, Tq, dim))
value=tf.random.uniform(shape=(batch_size, Tv, dim))
key=tf.random.uniform(shape=value.shape)
layer=tf.keras.layers.AdditiveAttention(use_scale=True, causal=True)
output, score=layer(inputs=[query, value, key], return_attention_scores=True)

The score obtained in the last line seems to be a lower triangular matrix. But my question is why does it have to be for the system to be causal, whether for language translation applications or for time series?

In both cases, the whole past historic lag data (for time series) or source sentence tokens (for translation tasks) are available and already encoded by the preceding RNN/LSTM layers, right? So even if the second (starting from zero index) output context does depend linearly on fifth input state by $s_2=\sum_{i=0}^{12}\alpha_ih_i$ and $\alpha_5\neq0$, what is the problem there?

Topic bahdanau attention-mechanism lstm machine-translation time-series

Category Data Science

Why does Bahdanau Attention Have to be Causal?

About