Topic modelling with many synonyms - how to extract 'latent themes'
Here's my corpus
{
0: dogs are nice, # canines are friendly
1: mutts are kind, # canines are friendly
2: pooches are lovely, # canines are friendly
...,
3: cats are mean, # felines are unfriendly
4: moggies are nasty, # felines are unfriendly
5: pussycats are unkind, # felines are unfriendly
}
As a human, the general topics I get from these documents are that:
- canines are friendly (0, 1, 2)
- felines are not friendly (3, 4, 5)
But how can a machine find the same conclusion?
If I were to do a Latent Dirichlet Allocation approach, I feel like it would struggle to find topics because the synonyms are 'diluting' the underlying meaning. For example:
- dogs, pooches, and mutts could all fall under canines
- nice, kind, and lovely could all fall under friendly personality trait
Is there a way where I can use an already-trained set of latent vectors (e.g., Google News-vectors-negative300.bin.gz) to represent each document in these broader entities, and then find the topics? (i.e., instead of using the 'raw' words)
Does that even make sense?
EDIT: Come to think of it, I think my question essentially boils down to: is it possible to replace/redefine a set of similar-meaning words with a single all-encompassing word?
Topic lda topic-model nlp
Category Data Science