two different attention methods for seq2seq

I see two different ways of applying attention in seq2seq:

(a) the context vector (the weighted sum of encoder hidden states) fed into the output softmax, as shown in the diagram below. The diagram is from here.

(b) the context vector fed into the decoder input as shown the diagram below. The diagram is from here.

What are the pros and the cons of the two different approaches? Is there any paper comparing the two?

Topic attention-mechanism sequence-to-sequence

Category Data Science


(a) is Luong's attention mechanism (link) while (b) is Bahdanau's mechanism (link)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.