topic-model

Automatic topic labelling for topic modelling

shivanshu dhawan

2022年6月2日 13:06

I am just curious to know if there is a way to automatically get the lables for the topics in Topic modelling. It would be really helpful if there's any python implementation of it.

Topic: python-3.x topic-model nlp machine-learning

Category: Data Science

List of words cluster by topics

mathprogram

2022年6月1日 09:37

I have a list of words, these words correspond to labels in news and they are not duplicated. I would like to get a clustering for this list based on topics. I try with wordnet but I don't know how could get it if I only have a list of unique words and not more information. Thank you!

Topic: lda topic-model python

Category: Data Science

What does the alpha and beta hyperparameters contribute to in Latent Dirichlet allocation?

alvas

2022年5月20日 16:20

LDA has two hyperparameters, tuning them changes the induced topics. What does the alpha and beta hyperparameters contribute to LDA? How does the topic change if one or the other hyperparameters increase or decrease? Why are they hyperparamters and not just parameters?

Topic: parameter lda topic-model

Category: Data Science

Calculate an ambiguity score based on topic models and Hellinger distance

Sebastian Galan

2022年5月13日 16:43

I am trying to calculate some sort of ambiguity score from text based on topic probabilities from a Latent Dirichlet Allocation model and the Hellinger distance between the topic distributions. Let’s say I constructed my LDA model with 3 topics, these topics are related to basketball, football, and banking, respectively. I would like some kind of score that says that if the topic probabilities of a document is Basketball: $0.33$, Football: $0.33$, and Banking: $0.33$, that document is more ambiguous …

Topic: lda topic-model nlp

Category: Data Science

Are the word of women and men different when expressing their views on the same subject?

nem0

2022年4月30日 20:29

My data includes women's comments on X and Y and men's comments on X and Y. Each comment is of equal length. I will calculate how much different the word choice between men and women when commenting on X. How can it do this?

Topic: gensim sentiment-analysis lda topic-model nlp

Category: Data Science

How can I implement text classification for this problem?

Usama

2022年4月27日 13:17

Given a collection of documents - each corresponding to some economic entity - I am looking to extract information and populate a table with predetermined headings. I have a small sample of this already done by humans and I was wondering if there's an efficient way to automatise it. Grateful for any suggestions.

Topic: text-classification topic-model

Category: Data Science

Memory error - Hierarchical Dirichlet Process, HDP gensim

work_in_progress

2022年4月26日 00:09

I am running Hierarchical Dirichlet Process, HDP using gensim in Python but as my corpus is too large it is throwing me following error: model = gensim.models.HdpModel(corpus, id2word=corpus.id2word, chunksize=50000) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 210, in __init__ self.update(corpus) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 245, in update self.update_chunk(chunk) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 313, in update_chunk self.update_lambda(ss, word_list, opt_o) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 415, in update_lambda rhot * self.m_D * sstats.m_var_beta_ss / sstats.m_chunksize MemoryError I have loaded my corpus using following statement: corpus = gensim.corpora.MalletCorpus('chunk5000K_records.mallet') And the data …

Topic: gensim lda topic-model nlp python

Category: Data Science

Hierarchical dirichlet process results

Muntaser

2022年4月25日 09:01

I am thinking about using hierarchical dirichlet process to model a patent dataset. I've seen that HDP uses a base distribution and assumes that every topic comes from that base distribution. The problem is: first I'm wondering what are the main results from the HDP procedure (in the case of LDA we obtain two matrices that we can use to construct word clouds and graphs but in this case I'm not sure about the results) and what is the exact …

Topic: dirichlet unsupervised-learning topic-model data-cleaning data-mining

Category: Data Science

Topic modelling with many synonyms - how to extract 'latent themes'

Ben

2022年4月25日 00:02

Here's my corpus { 0: "dogs are nice", # canines are friendly 1: "mutts are kind", # canines are friendly 2: "pooches are lovely", # canines are friendly ..., 3: "cats are mean", # felines are unfriendly 4: "moggies are nasty", # felines are unfriendly 5: "pussycats are unkind", # felines are unfriendly } As a human, the general topics I get from these documents are that: canines are friendly (0, 1, 2) felines are not friendly (3, 4, 5) …

Topic: lda topic-model nlp

Category: Data Science

Understanding output of gensim LDA topic modeling API

Maha

2022年4月12日 17:10

I was trying to understand gensim mallet wrapper for topic modeling as explained in this notebook. In point 11, it prepares corpus which if of format Term Document frequency: >>> print(corpus[:1]) # for 1st document >>> [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), …

Topic: gensim lda topic-model machine-learning

Category: Data Science

Topic Modelling in an existing dataframe in python

Aliya Leigh

2022年4月6日 12:25

I am trying to perform topic extraction in a panda dataframe. I am using LDA topic modeling in order to extract the topics in my dataframe. No problem. But, I would like to apply LDA topic modeling to each row in my dataframe. Current datafame: date cust_id words 3/14/2019 100001 samantha slip skirt pi ski 1/21/2020 10002 steel skirt solid greenish 5/19/2020 10003 arizona denim blouse d The dataframe I am looking for: date cust_id words topic 0 words topic …

Topic: dataframe pandas lda topic-model python

Category: Data Science

Choice of the number of topics (clusters) in textual data

Himan

2022年4月6日 11:51

I have a social science background and I'm doing a text mining project. I'm looking for advice about the choice of the number of topics/clusters when analyzing textual data. In particular, I'm analyzing a dataset of more than 200000 tweets and I'm performing a Latent Dirichlet allocation model on them to find clusters that represent the main topics of the tweets of my dataset. However, I was trying to decide the optimal number of clusters but the results I'm finding …

Topic: text-mining lda topic-model nlp clustering

Category: Data Science

Topic Modeling: LDA vs LSA vs ToPMine

Peter

2022年4月6日 11:44

I am new to Topic Modeling. Is it possible to implement ToPMine in Python? In a quick search, I can't seem to find any Python package with ToPMine. Is ToPMine better than LDA and LSA? I am aware that LDA & LSA have been around for a long time and widely used. Thank you

Topic: lda topic-model

Category: Data Science

Hellinger Distance in Gensim

Smith Volka

2022年3月29日 05:01

I have set of documents as follows where each document has set of words that represents the content of it. Doc1: {fish, moose, wildlife, hunting, bears, polar} Doc2: {energy, fuel, costs, oil, gas} Doc3: {wildlife, hunt, polar, fishing} So, if I look at my documents I can deduce that Doc1 and Doc3 are very much similar. I want distance metrics for bag-of-words. I followed some tutorials in Gensim about how to do it. However, as I understand, initially they train …

Topic: gensim text-mining topic-model data-mining machine-learning

Category: Data Science

Topic modelling on only 24 documents gives the same "topic" for any K

Luisda

2022年3月19日 23:00

Description: I have 24 documents, each one of around 2.5K tokens. They are public speeches. My text preprocessing pipeline is a generic one, including punctuation removal, expansion of English contractions, removal of stopwords and tokenization. I have implemented and analyzed both Latent Dirichlet Allocation and Latent Semantic Analysis in Python and gensim. I am calculating the optimal number of topic by the topics' coherence. Issue: For any number of topics K (I have tried many, e.g. 10, 50, 100, 200) …

Topic: lsi gensim lda topic-model

Category: Data Science

How to map topic to a document after topic modeling is done with LDA

user1845926

2022年3月18日 03:06

Is there any way I can map generated topic from LDA to the list of documents and identify to which topic it belongs to ? I am interested in clustering documents using unsupervised learning and segregating it into appropriate cluster. Any link, code example, paper will greatly be appreciated.

Topic: lda topic-model

Category: Data Science

Topic models for non-textual data?

Jamie

2022年2月28日 11:01

I am looking to employ an unsupervised clustering on a dataset where each observation has a mix of textual and non-textual features. For each observation, I combine the features into a single vector of ~1000 dimensions. To cluster I have two potential ideas: Using an autoencoder (or an embedding?) to reduce the dimensionality of the data and then cluster using k-means. Could I use a topic model? If so, isn't this the superior method in most circumstances to the above? …

Topic: unsupervised-learning topic-model k-means clustering

Category: Data Science

Multi-Label Text Topic Classification

StatsPY

2022年2月27日 14:49

I have a huge dataset of messages/comments classified with topics. The dataset consists of 1kk records and have a total of 90 topics, like this: text topic1 topic2 .... topic90 comment 1 0 1 comment 0 1 0 I want to use a supervised method as I have already labeled all comments. I want to know what are the recommended approach to tackle this problem. The topics are quite unbalanced.

Topic: supervised-learning topic-model nlp

Category: Data Science

Measuring coherence score for Top2Vec models

Teefs

2022年2月26日 17:03

I am working on creating a number of Top2Vec models on Reddit threads. I am basically changing the HDBScan cluster sizes to get different clusters of the Doc2Vec embeddings representing a different # of topics. I am trying to compare different models using their coherence score. I have tried using Gensim's coherence score but failed. I got an error message indicating that a word in the topics is not included in the dictionary. I also tried using tmtooklit. While I …

Topic: coherence topic-model nlp

Category: Data Science

Classify documents using a set of known vocabularies

Tuan Do

2022年2月24日 09:01

I have a bunch of documents that I want to classify which ones talk about soccer (unsupervised learning, I do not want to manually label the documents). One way I am thinking about is to go online and search for the most popular words in soccer articles to make a list of vocabularies (for example: score, shoot, World Cup, etc). Then somehow use that list of vocabularies to classify the documents (maybe if a particular contains 30% of the words …

Topic: text-mining topic-model nlp

Category: Data Science

About