For an LSTM-based seq2seq model, is reversing the input still necessary or advised when using attention?

The original seq2seq paper reversed the input sequence and cited multiple reasons for doing so. See: Why does LSTM performs better when the source target is reversed? (Seq2seq)

But when using attention, is there still any benefit to doing this? I imagine since the decoder has access to the encoder hidden states at each time step, it can learn what to attend to and the input can be fed in the original order.

Topic attention-mechanism sequence-to-sequence lstm machine-translation

Category Data Science


No, this trick is only relevant to the vanilla sequence-to-sequence models. It is exactly as you say. The decoder gets the information from the encoder via the attention mechanism, so there is no way the long distance between source and target words could block the capacity of the RNN state. Also, the encoder is typically bi-directional, which means there are two RNNs each of them processing the source sentence in a different direction. As a result, each encoder state contains information about the complete source sentence, one half in the left-side context, the other in the right-side context.

With today's Transformer models, it does not make any sense because Transformers treat the input as an unordered set and the only way it knows about the token order is the position embeddings (if the position embeddings are correct, you can permute the input tokens randomly).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.