Cluster words into groups of similar meaning (synonyms)

How can words be clustered into groups of similar meaning (synonyms)?

I started with pre-trained word embeddings (e.g., Google News), which is great, but not perfect - a limitation arises because the word embeddings are based on surrounding words. This introduces challenging results. For example:

  • polar meanings: word embeddings might find opposites to be similar. Even though these words mean the opposite semantically, they can quite readily be interchanged given the same preceding and following words. For example, terrible and fantastic are easily interchangeable in the following sentence: I heard some ____ news today. Another example is happy and unhappy in the following sentence: I was very ____ with the service.
  • euphemisms: word embeddings may find euphemistic words and doublespeak to be similar. Even though these words are perceived very differently, they can still readily be interchanged given the same preceding and following words. For example, headstrong and stubborn are easily interchangeable in the following sentence: He is a ____ colleague. Other examples are: pamper/spoil, check out/perv, etc.

Is there a way to separate similar words (in a word-embeddings sense) into more semantic clusters?


Possible solution? Make one extra dimension that puts the word on a sentiment spectrum? This one dimension might be enough to sufficiently separate terrible from fantastic, and happy from unhappy, etc.

How might such a feature be engineered?

Note that many words would be neutral and so this additional dimension wouldn't really affect them. For example, Python/R/C++/Java in the following sentence: 5+ years experience of ____ required.

Topic semantic-similarity text word-embeddings nlp clustering

Category Data Science


It is not possible to find synonyms by starting with word embeddings. Word embeddings group words by co-occurrence. For example "and" and "but" will be near each other in embedding space, even though those words have opposite meanings.

One option to find semantic clusters is to use WordNet, a large lexical database of English. WordNet models the meaning of the words.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.