Initializing weights that are a pointwise product of multiple variables

In two-layer perceptrons that slide across words of text, such as word2vec and fastText, hidden layer heights may be a product of two random variables such as positional embeddings and word embeddings (Mikolov et al. 2017, Section 2.2): $$v_c = \sum_{p\in P} d_p \odot u_{t+p}$$ However, it's unclear to me how to best initialize the two variables.

When only word embeddings are used for the hidden layer weights, word2vec and fastText initialize them to $\mathcal{U}(-1 / \text{fan_out}; 1 / \text{fan_out})$. When the product of two random variables is used, we might:

  • initialize the first variable with ones and the other variable with $\mathcal{U}(-1 / \text{fan_out}; 1 / \text{fan_out})$: This would maintain the distribution of the weights, but the gradients to the second variable would be way too large.

  • initialize the variables with a 2-factor of $\mathcal{U}(0, 1)$:

    and then rescale their product to $[-1 / \text{fan_out}; 1 / \text{fan_out}]$. This would maintain the distribution of the weights, but enlarge the gradients to both variables, since they are now both initialized to ones or close to ones.

I will appreciate any ideas and pointers to existing research in this direction.

Topic fasttext weight-initialization word2vec word-embeddings nlp

Category Data Science


The most useful ways to initialize embedding model weights is either random or with pre-existing weights. If random weight initialization is chosen, the samples should be between 0 and 1.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.