NLP: Compare tags semantically with machine learning? (finding synonyms)

Let's say I have multiple tags that I need to compare semantically. For example:

tags = ['Python', 'Football', 'Programming', 'Handball', 'Chess', 'Cheese', 'board game']

I would like compare these tags (and many more) semantically to find a similarity value between 0 and 1. For example, I want to get values like these:

f('Chess','Cheese') = 0.0  # tags look similar, but means very different things
f('Chess', 'board game') = 0.9 # because chess is a board game
f('Football', 'Handball') = 0.3 # because both are sports with a ball
f('Python', 'Programming') = 0.9 # because Python is a programming language

So what is the state of the art approach to get a function f like this? I know that machine learning might be doing this, but this area is huge and overwhelming for me (on the first glance it looks like NLP focuses on other problems). So what would be the best approach for this specific problem?

Topic semantic-similarity nlp machine-learning

Category Data Science


You can use a two stage approach:

  1. Apply a knowledge graph to model the complex and varied relationships. Knowledge graph can use semantic triples to capture relationships.

  2. Then map the knowledge graph relationship to a distance metric. The book "Geometry and Meaning" by Dominic Widdows goes into greater detail on how to convert a graph relationships to distance metrics.


You can use word embeddings take a look at word2vec and glove. There are many good tutorials on word2vec, but very briefly word2vec represents a word as a vector, so that you can do a similarity comparison, and also allows addition and subtraction of vectors.

The embeddings are already available, for example here: https://nlp.stanford.edu/projects/glove/

Here is an example, using glove embeddings:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

word_to_embeddings = dict()

with open("/path/to/glove.twitter.27B.25d.txt") as f:

    for line in f:
        word = line.split()[0]
        embeddings = np.asarray(line.split()[1:], dtype='float32')
        word_to_embeddings[word] = embeddings


w = word_to_embeddings

sim_1 = cosine_similarity([w["chess"]], [w["cheese"]])[0][0]
print(f"chess and cheese sim: {sim_1}")

sim_2 = cosine_similarity([w["chess"]], [w["board"] + w["game"]])[0][0]
print(f"chess and board + game sim: {sim_2}")

sim_3 = cosine_similarity([w["football"]], [w["handball"]])[0][0]
print(f"football and handball sim: {sim_3}")

sim_4 = cosine_similarity([w["python"]], [w["programming"]])[0][0]
print(f"python and programming sim: {sim_4}")

sim_5 = cosine_similarity([w["c++"]], [w["programming"]])[0][0]
print(f"c++ and programming sim: {sim_5}")

Result:

chess and cheese sim: 0.4501191973686218
chess and board + game sim: 0.7286249995231628
football and handball sim: 0.7467300891876221
python and programming sim: 0.7076062560081482
c++ and programming sim: 0.8262864351272583

The results are not going to be exactly what you expect, for example Python could also refer to a snake. You might want to play around with different embeddings, because they can give different results.


It's a complex problem, the standard method is to train a model which represents this semantic similarity function using a large text corpus (i.e. not only the tags themselves), by considering that the meaning of a word is provided by its context words.

A more direct approach is to use a pre-trained word embeddings vectors that can be compared directly to find their similarity.

A similar idea is to use WordNet, which is a database of words with their semantic similarity to each other.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.