Word2Vec: Identifying many-to-one relationships between words

Standard introductory examples in Word2Vec, like king - queen = man - woman and tokyo - japan = london - uk, involve one-to-one relationships between words: Tokyo is the exclusive capital of Japan.

More generally, we might want to test for many-to-one relationships: e.g. we might want to ask if Kyoto is a city in Japan. I presume we are still interested in vectors of the form kyoto - japan, houston - us, etc., but these vectors are no longer equal.

Do these relationship vectors form a particularly interesting vector space? Do they sample some known distribution? How can I check a many-to-one relationship from the word embeddings?

Topic vector-space-models ai word2vec word-embeddings nlp

Category Data Science


Those vector relations are not exact. Rest assured that king - queen ≠ man - woman. What we do is finding the closest vectors to the result of king - man + woman. One of the closest vectors is queen.

Nevertheless, when we try the "parallelogram approach" to verify word relations, in most cases, the closest vector is the original one. The fact that we disregard the original word as a sensible result is due to the original implementation of word2vec, as shown in the study Fair is Better than Sensational: Man is to Doctor as Woman is to Doctor. This implementation quirk is actually one of the sources of gender bias attributed to word embeddings.

There are other articles studying the limitations of the analogies based on word embeddings, like this and this.

That being said, there are also studies of the relationships that can be found in word representation spaces, like this one, which studies different types of relationships, e.g. class-inclussion, part-whole, attributes, etc. You may find the main figure of the article interesting:

enter image description here

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.