Cluster words into groups of similar meaning (synonyms)
How can words be clustered into groups of similar meaning (synonyms)?
I started with pre-trained word embeddings (e.g., Google News), which is great, but not perfect - a limitation arises because the word embeddings are based on surrounding words. This introduces challenging results. For example:
- polar meanings: word embeddings might find opposites to be similar. Even though these words mean the opposite semantically, they can quite readily be interchanged given the same preceding and following words. For example, terrible and fantastic are easily interchangeable in the following sentence: I heard some ____ news today. Another example is happy and unhappy in the following sentence: I was very ____ with the service.
- euphemisms: word embeddings may find euphemistic words and doublespeak to be similar. Even though these words are perceived very differently, they can still readily be interchanged given the same preceding and following words. For example, headstrong and stubborn are easily interchangeable in the following sentence: He is a ____ colleague. Other examples are: pamper/spoil, check out/perv, etc.
Is there a way to separate similar words (in a word-embeddings sense) into more semantic clusters?
Possible solution? Make one extra dimension that puts the word on a sentiment spectrum? This one dimension might be enough to sufficiently separate terrible from fantastic, and happy from unhappy, etc.
How might such a feature be engineered?
Note that many words would be neutral and so this additional dimension wouldn't really affect them. For example, Python/R/C++/Java in the following sentence: 5+ years experience of ____ required.
Topic semantic-similarity text word-embeddings nlp clustering
Category Data Science