Why does Bahdanau Attention Have to be Causal?

Using the Bahdanau attention layer on Tensorflow for time series prediction, although conceptually it is similar to NLP applications.

This is how the minimal example code for a single layer looks like.

import tensorflow as tf
dim=7
Tq=5 # Number of future time steps to predict
Tv=13 # Number of historic lag timesteps to consider
batch_size=2**4
query=tf.random.uniform(shape=(batch_size, Tq, dim))
value=tf.random.uniform(shape=(batch_size, Tv, dim))
key=tf.random.uniform(shape=value.shape)
layer=tf.keras.layers.AdditiveAttention(use_scale=True, causal=True)
output, score=layer(inputs=[query, value, key], return_attention_scores=True)

The score obtained in the last line seems to be a lower triangular matrix. But my question is why does it have to be for the system to be causal, whether for language translation applications or for time series?

In both cases, the whole past historic lag data (for time series) or source sentence tokens (for translation tasks) are available and already encoded by the preceding RNN/LSTM layers, right? So even if the second (starting from zero index) output context does depend linearly on fifth input state by $s_2=\sum_{i=0}^{12}\alpha_ih_i$ and $\alpha_5\neq0$, what is the problem there?

Topic bahdanau attention-mechanism lstm machine-translation time-series

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.