Multi-Head attention mechanism in transformer and need of feed forward neural network

After reading the paper, Attention is all you need, I have two questions:

1. What is the need of a multi-head attention mechanism?

The paper says that:

"Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions"

My understanding is that it helps in anaphora resolution. For example:- "The animal didn't cross the street because it was too ..... (tired/wide)". Here "it" can refer to animal or street based on the last word. My doubt is why can't a single attention head learn this link over some time?

2. I also don't understand the need of feed-forward neural network in the encoder module of the transformer.

Thanks for your help.

Topic deep-learning machine-translation neural-network machine-learning

Category Data Science


(Answering just the first part of the question, based on my own understanding.)

Multi-head attention is a randomly-initialized multi-dimensional indexing system where different heads focus on different variations of the indexed token instance (token = initially e.g. a word or a part of a word) with respect to its containing sequence/context. Such encoded token instance contains information about all other tokens of the input sequence = a.k.a. self-attention, and these tokens vary based on how the given "attention head" was (randomly) initialized.

So we will (hopefully) end up (at least) with both variations (of indices/identities):

  1. the "it" which refers to the "street"
  2. the "it" which refers to the "animal"

So when the model is trained on real data, then for similar sentences/contexts which end with "tired" (referring to the animal), the identity/index of 2. will be used (i.e. will get a higher weight) when constructing the output sequence.


  1. The basic reasoning, I think, is just to increase capacity. While it is possible in theory for a single head, using multiple simply makes it easier. More specifically though, the paper says (pg 4):

    Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

    In other words, for a given layer, with only one attention head, the weighted averaging performed by the mechanism prevents you from being able access (differently transformed) information from multiple areas of the input within that single layer. I think this can be mitigated by using multiple layers, but adding this capability simply increases the capacity of a single layer, and therefore you can perform more useful calculations with less layers. Often this is good because stacking layers (function composition) can lead to more issues than parallelizing layers (as the multi-head does), especially when it comes to gradient computations. (This is partly why residual/skip connections are so useful).

    In general, using multi-head is a common way to increase representational power in attention-based models, for instance see Graph Attention Networks by Velickovic et al. The basic idea is that, although attention allows you to "focus" on more relevant information and ignore useless noise, it can also eliminate useful information since usually the amount of information that can make it through the attention mechanism is quite limited. Using multiple heads gives you space to let more through.

  2. Presumably you are asking about the "Position-wise Feed-Forward Networks". Again, I think this is merely a question of model capacity: the model could function without these layers, but presumably not as well. These layers are interesting because they are essentially 1D convolutions, which usually appear near the end of a network or block in computer vision architectures (which I am more familiar with). In other words, they are localized transforms that operate identically across the input (spatially). The reasoning is often that the previous layers (here, the attention layers; in vision, the conv layers with larger kernel sizes) were reasonable for passing or mixing information spatially across the input. E.g., after an attention layer, the latent representation at each position contains information from other positions. After this, however, we want to consolidate a "unique" representation for each position (which has been informed of the other positions, of course). We do this via processing by a localized layer, which does not consider neighbours or other positions, and simply transforms the local representation on its own. One thing to keep in mind is that, while we want information to mix across "space" (e.g., across an image or sentence), for many tasks we often still need each position to maintain some resemblance/connection to its original identity (rather than say just averaging all information over all positions). So not every layer need be spatially aware. In vision models, 1D convolutions (i.e., kernel width 1) are often used to perform dimensionality reductions; in this case, the dimensionality does not appear to change however.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.