Self Attention vs LSTM with Attention for NMT
I am trying to compare the
A: Transformer-based architecture for Neural Machine Translation (NMT) from the Attention is All You Need paper, with
B: an architecture based on Bi-directional LSTM's in the encoder coupled with a unidirectional LSTM in the decoder, which attends to all the hidden states of the encoder, creates a weighted combination and uses this along with decoder (unidirectional) LSTM output to produce final output word.
My question is what might be the advantages of Architecture A over B i.e. Self Attention vs LSTM's with attention?
I would imagine that Architecture A has a big advantage of having parallel computation compared to the sequential nature of computation in Architecture B.
Are there any other advantages? In particular, would architecture A have maximum path length advantages as described in the Attention is All you need paper?
Topic attention-mechanism lstm machine-translation
Category Data Science