List of words cluster by topics

I have a list of words, these words correspond to labels in news and they are not duplicated. I would like to get a clustering for this list based on topics. I try with wordnet but I don't know how could get it if I only have a list of unique words and not more information. Thank you!
Category: Data Science

Calculate an ambiguity score based on topic models and Hellinger distance

I am trying to calculate some sort of ambiguity score from text based on topic probabilities from a Latent Dirichlet Allocation model and the Hellinger distance between the topic distributions. Let’s say I constructed my LDA model with 3 topics, these topics are related to basketball, football, and banking, respectively. I would like some kind of score that says that if the topic probabilities of a document is Basketball: $0.33$, Football: $0.33$, and Banking: $0.33$, that document is more ambiguous …
Category: Data Science

Memory error - Hierarchical Dirichlet Process, HDP gensim

I am running Hierarchical Dirichlet Process, HDP using gensim in Python but as my corpus is too large it is throwing me following error: model = gensim.models.HdpModel(corpus, id2word=corpus.id2word, chunksize=50000) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 210, in __init__ self.update(corpus) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 245, in update self.update_chunk(chunk) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 313, in update_chunk self.update_lambda(ss, word_list, opt_o) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 415, in update_lambda rhot * self.m_D * sstats.m_var_beta_ss / sstats.m_chunksize MemoryError I have loaded my corpus using following statement: corpus = gensim.corpora.MalletCorpus('chunk5000K_records.mallet') And the data …
Category: Data Science

Hierarchical dirichlet process results

I am thinking about using hierarchical dirichlet process to model a patent dataset. I've seen that HDP uses a base distribution and assumes that every topic comes from that base distribution. The problem is: first I'm wondering what are the main results from the HDP procedure (in the case of LDA we obtain two matrices that we can use to construct word clouds and graphs but in this case I'm not sure about the results) and what is the exact …
Category: Data Science

Topic modelling with many synonyms - how to extract 'latent themes'

Here's my corpus { 0: "dogs are nice", # canines are friendly 1: "mutts are kind", # canines are friendly 2: "pooches are lovely", # canines are friendly ..., 3: "cats are mean", # felines are unfriendly 4: "moggies are nasty", # felines are unfriendly 5: "pussycats are unkind", # felines are unfriendly } As a human, the general topics I get from these documents are that: canines are friendly (0, 1, 2) felines are not friendly (3, 4, 5) …
Category: Data Science

Understanding output of gensim LDA topic modeling API

I was trying to understand gensim mallet wrapper for topic modeling as explained in this notebook. In point 11, it prepares corpus which if of format Term Document frequency: >>> print(corpus[:1]) # for 1st document >>> [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), …
Category: Data Science

Topic Modelling in an existing dataframe in python

I am trying to perform topic extraction in a panda dataframe. I am using LDA topic modeling in order to extract the topics in my dataframe. No problem. But, I would like to apply LDA topic modeling to each row in my dataframe. Current datafame: date cust_id words 3/14/2019 100001 samantha slip skirt pi ski 1/21/2020 10002 steel skirt solid greenish 5/19/2020 10003 arizona denim blouse d The dataframe I am looking for: date cust_id words topic 0 words topic …
Category: Data Science

Choice of the number of topics (clusters) in textual data

I have a social science background and I'm doing a text mining project. I'm looking for advice about the choice of the number of topics/clusters when analyzing textual data. In particular, I'm analyzing a dataset of more than 200000 tweets and I'm performing a Latent Dirichlet allocation model on them to find clusters that represent the main topics of the tweets of my dataset. However, I was trying to decide the optimal number of clusters but the results I'm finding …
Category: Data Science

Topic Modeling: LDA vs LSA vs ToPMine

I am new to Topic Modeling. Is it possible to implement ToPMine in Python? In a quick search, I can't seem to find any Python package with ToPMine. Is ToPMine better than LDA and LSA? I am aware that LDA & LSA have been around for a long time and widely used. Thank you
Category: Data Science

Hellinger Distance in Gensim

I have set of documents as follows where each document has set of words that represents the content of it. Doc1: {fish, moose, wildlife, hunting, bears, polar} Doc2: {energy, fuel, costs, oil, gas} Doc3: {wildlife, hunt, polar, fishing} So, if I look at my documents I can deduce that Doc1 and Doc3 are very much similar. I want distance metrics for bag-of-words. I followed some tutorials in Gensim about how to do it. However, as I understand, initially they train …
Category: Data Science

Topic modelling on only 24 documents gives the same "topic" for any K

Description: I have 24 documents, each one of around 2.5K tokens. They are public speeches. My text preprocessing pipeline is a generic one, including punctuation removal, expansion of English contractions, removal of stopwords and tokenization. I have implemented and analyzed both Latent Dirichlet Allocation and Latent Semantic Analysis in Python and gensim. I am calculating the optimal number of topic by the topics' coherence. Issue: For any number of topics K (I have tried many, e.g. 10, 50, 100, 200) …
Category: Data Science

Topic models for non-textual data?

I am looking to employ an unsupervised clustering on a dataset where each observation has a mix of textual and non-textual features. For each observation, I combine the features into a single vector of ~1000 dimensions. To cluster I have two potential ideas: Using an autoencoder (or an embedding?) to reduce the dimensionality of the data and then cluster using k-means. Could I use a topic model? If so, isn't this the superior method in most circumstances to the above? …
Category: Data Science

Multi-Label Text Topic Classification

I have a huge dataset of messages/comments classified with topics. The dataset consists of 1kk records and have a total of 90 topics, like this: text topic1 topic2 .... topic90 comment 1 0 1 comment 0 1 0 I want to use a supervised method as I have already labeled all comments. I want to know what are the recommended approach to tackle this problem. The topics are quite unbalanced.
Category: Data Science

Measuring coherence score for Top2Vec models

I am working on creating a number of Top2Vec models on Reddit threads. I am basically changing the HDBScan cluster sizes to get different clusters of the Doc2Vec embeddings representing a different # of topics. I am trying to compare different models using their coherence score. I have tried using Gensim's coherence score but failed. I got an error message indicating that a word in the topics is not included in the dictionary. I also tried using tmtooklit. While I …
Category: Data Science

Classify documents using a set of known vocabularies

I have a bunch of documents that I want to classify which ones talk about soccer (unsupervised learning, I do not want to manually label the documents). One way I am thinking about is to go online and search for the most popular words in soccer articles to make a list of vocabularies (for example: score, shoot, World Cup, etc). Then somehow use that list of vocabularies to classify the documents (maybe if a particular contains 30% of the words …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.