Topic modelling with many synonyms - how to extract 'latent themes'

Question

Topic modelling with many synonyms - how to extract 'latent themes'

Ben

2022年4月25日 00:02

Here's my corpus

{
    0: dogs are nice,       # canines are friendly
    1: mutts are kind,      # canines are friendly
    2: pooches are lovely,  # canines are friendly
    ...,
    3: cats are mean,         # felines are unfriendly
    4: moggies are nasty,     # felines are unfriendly
    5: pussycats are unkind,  # felines are unfriendly
}

As a human, the general topics I get from these documents are that:

canines are friendly (0, 1, 2)
felines are not friendly (3, 4, 5)

But how can a machine find the same conclusion?

If I were to do a Latent Dirichlet Allocation approach, I feel like it would struggle to find topics because the synonyms are 'diluting' the underlying meaning. For example:

dogs, pooches, and mutts could all fall under canines
nice, kind, and lovely could all fall under friendly personality trait

Is there a way where I can use an already-trained set of latent vectors (e.g., Google News-vectors-negative300.bin.gz) to represent each document in these broader entities, and then find the topics? (i.e., instead of using the 'raw' words)

Does that even make sense?

EDIT: Come to think of it, I think my question essentially boils down to: is it possible to replace/redefine a set of similar-meaning words with a single all-encompassing word?

Topic lda topic-model nlp

Category Data Science

Brian Spiering · Accepted Answer · 2021年7月17日 21:57

One option is start with pattern matching with established tools. For each statement, find the subject and sentiment.

Here are off-the-shelf tools (no training or machine learning):

WordNet a large lexical database of English words.
Synset is a way to find instances are the groupings of synonymous words that express the same concept.
SentiWordNet assigns to each synset of WordNet.

Topic modelling with many synonyms - how to extract 'latent themes'

About