Attention network without hidden state?
I was wondering how useful the encoder's hidden state is for an attention network. When I looked into the structure of an attention model, this is what I found a model generally looks like:
x: Input.
h: Encoder's hidden state which feeds forward to the next encoder's hidden state.
s: Decoder's hidden state which has a weighted sum of all the encoder's hidden states as input and feeds forward to the next decoder's hidden state.
y: Output.
With a process like translation, why is it important for encoder's hidden states to feed-forward or exist in the first place? We already know what the next x is going to be. Thereby, the order of the input isn't necessarily important for the order of the output, neither is what has been memorized from the previous input as the attention model looks at all inputs simultaneously. Couldn't you just use attention directly on the embedding of x?
Topic attention-mechanism rnn machine-translation machine-learning
Category Data Science