Self Attention vs LSTM with Attention for NMT

I am trying to compare the

  • A: Transformer-based architecture for Neural Machine Translation (NMT) from the Attention is All You Need paper, with

  • B: an architecture based on Bi-directional LSTM's in the encoder coupled with a unidirectional LSTM in the decoder, which attends to all the hidden states of the encoder, creates a weighted combination and uses this along with decoder (unidirectional) LSTM output to produce final output word.

My question is what might be the advantages of Architecture A over B i.e. Self Attention vs LSTM's with attention?

I would imagine that Architecture A has a big advantage of having parallel computation compared to the sequential nature of computation in Architecture B.

Are there any other advantages? In particular, would architecture A have maximum path length advantages as described in the Attention is All you need paper?

Topic attention-mechanism lstm machine-translation

Category Data Science


The Transformer-based MT typically performs better than RNN-based MT in terms of translation quality. People used to claim that RNNs are better for low-resource language pairs, however, this is not true anymore with pre-trained models such as MASS or mBART.

The other advantage of Transformers is that at training time, they can be fully parallelized, whereas an RNN always processes the sentences sequentially. To compute the $n$-th state, you always need to wait until $n-1$-th is ready.

One disadvantage of the transformer decoder is that at every step it needs to attend to all previously decoded tokens, which makes the generation quadratic in theory (in practice, this can be parallelized quite well). When efficiency is a concern, it might be a good idea to combine a Transformer encoder with an RNN decoder.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.