ToPMine Python Implementation:
pip install topmine
from topmine.phrase_lda import PhraseLDA
from topmine.phrase_mining import PhraseMining
a = PhraseMining(["you are a goode boy boy boy boy boy yes joke joke joke."]*10+["you are a big joke"]*20)
p = a.mine()
PhraseLDA(*p).run()
lda = PhraseLDA(*p).run()
This is an implementation of the algorithm detailed in El-Kishky, Ahmed, et al. "Scalable topical phrase mining from text corpora." Proceedings of the VLDB Endowment 8.3 (2014): 305-316.APA
In order to run the code, simply follow these steps:
Put the file on which you want to run topmine in the folder named “input”
python -m topmine_src.run_phrase_mining input/{filename}
python -m topmine.run_phrase_lda {num_of_topic}
example
python -m topmine_src.run_phrase_mining input/dblp_5k.txt
python -m topmine.run_phrase_lda 4
LDA vs LSA
Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents onto a lower-dimensional space. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Rather than looking at each document isolated from the others, it looks at all the documents as a whole and the terms within them to identify relationships.
Latent Dirichlet Allocation(LDA) algorithm is an unsupervised learning algorithm that works on a probabilistic statistical model to discover topics that the document contains automatically.
LDA assumes that each document in a corpus contains a mix of topics that are found throughout the entire corpus. The topic structure is hidden - we can only observe the documents and words, not the topics themselves. Because the structure is hidden (also known as latent), this method seeks to infer the topic structure given the known words and documents.