Attention model with seq2seq over sequence

On the official tensorflow page there is one exmple of a decoder (https://www.tensorflow.org/tutorials/text/nmt_with_attention#next_steps):

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

The output here is of shape (batch_size, vocab). I would like to have an output with shape (batch_size, max_length, vocab), where max_length is the output sequence length. I suppose there can be something done with the TimeDistributed layer but I tried several things and nothing realy worked out. Is there any work around to obtain this?

One way would be:

self.attention = layers.TimeDristributed(BahdanauAttention(self.dec_units))

but since BahdanauAttention has two inputs I do not know how to fix it, since timedistributed layer does not work with multiple inputs easily.

Topic attention-mechanism lstm keras tensorflow

Category Data Science


The output of an Attention layer - the Context - is typically the SUM of the weighted inputs. Each of the input is diminished or magnified by the attention weights based on how relevant it is at that time-step.

So the context will have the same shape as the input. This is typically (batch_size, Encoder_Embedding_dimension). You need to generate this context at every timestep of the decoder. So you will automatically get max_length context's. So the attention layer logic is correct above, you now need to call this in the decoder in a loop.

See "def train_step(inp, targ, enc_hidden):" in the link you have shared where this is done.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.