Multi-Head attention mechanism in transformer and need of feed forward neural network
After reading the paper, Attention is all you need, I have two questions:
1. What is the need of a multi-head attention mechanism?
The paper says that:
"Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions"
My understanding is that it helps in anaphora resolution. For example:- "The animal didn't cross the street because it was too ..... (tired/wide)". Here "it" can refer to animal or street based on the last word. My doubt is why can't a single attention head learn this link over some time?
2. I also don't understand the need of feed-forward neural network in the encoder module of the transformer.
Thanks for your help.
Topic deep-learning machine-translation neural-network machine-learning
Category Data Science