Topic modelling with many synonyms - how to extract 'latent themes'

Here's my corpus

{
    0: dogs are nice,       # canines are friendly
    1: mutts are kind,      # canines are friendly
    2: pooches are lovely,  # canines are friendly
    ...,
    3: cats are mean,         # felines are unfriendly
    4: moggies are nasty,     # felines are unfriendly
    5: pussycats are unkind,  # felines are unfriendly
}

As a human, the general topics I get from these documents are that:

  • canines are friendly (0, 1, 2)
  • felines are not friendly (3, 4, 5)

But how can a machine find the same conclusion?

If I were to do a Latent Dirichlet Allocation approach, I feel like it would struggle to find topics because the synonyms are 'diluting' the underlying meaning. For example:

  • dogs, pooches, and mutts could all fall under canines
  • nice, kind, and lovely could all fall under friendly personality trait

Is there a way where I can use an already-trained set of latent vectors (e.g., Google News-vectors-negative300.bin.gz) to represent each document in these broader entities, and then find the topics? (i.e., instead of using the 'raw' words)

Does that even make sense?


EDIT: Come to think of it, I think my question essentially boils down to: is it possible to replace/redefine a set of similar-meaning words with a single all-encompassing word?

Topic lda topic-model nlp

Category Data Science


One option is start with pattern matching with established tools. For each statement, find the subject and sentiment.

Here are off-the-shelf tools (no training or machine learning):

  • WordNet a large lexical database of English words.
  • Synset is a way to find instances are the groupings of synonymous words that express the same concept.
  • SentiWordNet assigns to each synset of WordNet.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.