Topic Modeling: LDA vs LSA vs ToPMine

Question

Topic Modeling: LDA vs LSA vs ToPMine

Peter

2022年4月6日 11:44

I am new to Topic Modeling.

Is it possible to implement ToPMine in Python? In a quick search, I can't seem to find any Python package with ToPMine.
Is ToPMine better than LDA and LSA? I am aware that LDA LSA have been around for a long time and widely used.

Thank you

Topic lda topic-model

Category Data Science

Pluviophile · Accepted Answer · 2022年4月6日 11:44

ToPMine Python Implementation:

https://github.com/anirudyd/topmine

pip install topmine

from topmine.phrase_lda import PhraseLDA
from topmine.phrase_mining import PhraseMining
a = PhraseMining(["you are a goode boy boy boy boy boy yes joke joke joke."]*10+["you are a big joke"]*20)
p = a.mine()
PhraseLDA(*p).run()
lda = PhraseLDA(*p).run()

This is an implementation of the algorithm detailed in El-Kishky, Ahmed, et al. "Scalable topical phrase mining from text corpora." Proceedings of the VLDB Endowment 8.3 (2014): 305-316.APA

In order to run the code, simply follow these steps:

Put the file on which you want to run topmine in the folder named “input”
python -m  topmine_src.run_phrase_mining input/{filename}
python -m  topmine.run_phrase_lda {num_of_topic}
example
python -m  topmine_src.run_phrase_mining input/dblp_5k.txt
python -m  topmine.run_phrase_lda 4

LDA vs LSA

Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents onto a lower-dimensional space. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Rather than looking at each document isolated from the others, it looks at all the documents as a whole and the terms within them to identify relationships.

Latent Dirichlet Allocation(LDA) algorithm is an unsupervised learning algorithm that works on a probabilistic statistical model to discover topics that the document contains automatically.

LDA assumes that each document in a corpus contains a mix of topics that are found throughout the entire corpus. The topic structure is hidden - we can only observe the documents and words, not the topics themselves. Because the structure is hidden (also known as latent), this method seeks to infer the topic structure given the known words and documents.

Erwan · Accepted Answer · 2022年1月20日 15:42

I don't know ToPMine but from a quick search it looks like the goal is a bit different than regular topic modeling, isn't it?

To my knowledge LSA is not strictly speaking a topic modeling method either, at least it's certainly not used as such anymore.

LDA is the standard traditional topic modeling method. Many variants and extensions have been proposed, I think that the following two are worth mentioning, but there might be others:

Topic Modeling: LDA vs LSA vs ToPMine

ToPMine Python Implementation:

LDA vs LSA

About