List of words cluster by topics

I have a list of words, these words correspond to labels in news and they are not duplicated. I would like to get a clustering for this list based on topics. I try with wordnet but I don't know how could get it if I only have a list of unique words and not more information. Thank you!
Category: Data Science

How to choose threshold for gensim Phrases when generating bigrams?

I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA from gensim.models.phrases import Phrases, Phraser # 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization, etc docs = get_docs() phrases = Phrases(docs) bigram = Phraser(phrases) docs = [bigram[d] for d in docs] Phrases has min_count=5, threshold=10. I don't quite understand how they interact, they seem related? Anyway, I see threshold having values in different tutorials ranging 1->1000, described as important in determining the number of …
Category: Data Science

Apply Labeled LDA on large data

I'm using a dataset contains about 1.5M document. Each document comes with some keywords describing the topics of this document(Thus multi-labelled). Each document belongs to some authors(not just one author for a document). I wanted to find out the topics interested by each author by looking at documents they write. I'm currently looking an LDA variation (labeled-LDA proposed by D Ramaga: https://www.aclweb.org/anthology/D/D09/D09-1026.pdf .). I'm using all the documents in my dataset to train a model and using the model to …
Category: Data Science

reuse of LDA model for new data

I am working with the LDA (Latent Dirichlet Allocation) model from sklearn and I have a question about reusing the model I have. After training my model with data how do I use it to make a prediction on a new data? Basically the goal is to read content of an email. countVectorizer = CountVectorizer(stop_words=stop_words) termFrequency = countVectorizer.fit_transform(corpus) featureNames = countVectorizer.get_feature_names() model = LatentDirichletAllocation(n_components=3) model.fit(termFrequency) joblib.dump(model, 'lda.pkl') # lda_from_joblib = joblib.load('lda.pkl') I save my model using joblib. Now I want …
Category: Data Science

Calculate an ambiguity score based on topic models and Hellinger distance

I am trying to calculate some sort of ambiguity score from text based on topic probabilities from a Latent Dirichlet Allocation model and the Hellinger distance between the topic distributions. Let’s say I constructed my LDA model with 3 topics, these topics are related to basketball, football, and banking, respectively. I would like some kind of score that says that if the topic probabilities of a document is Basketball: $0.33$, Football: $0.33$, and Banking: $0.33$, that document is more ambiguous …
Category: Data Science

How to construct the document-topic matrix using the word-topic and topic-word matrix calculated using Latent Dirichlet Allocation?

How to construct the document-topic matrix using the word-topic and topic-word matrix calculated using Latent Dirichlet Allocation? I can not seem to find it anywhere, even not from the author of LDA, M.Blei. Gensim and sklearn just work, but I want to know how to use the two matrices to construct the document topic-matrix (Spark MLLIB LDA only gives me the 2 matrices and not the document-topic matrix).
Category: Data Science

Memory error - Hierarchical Dirichlet Process, HDP gensim

I am running Hierarchical Dirichlet Process, HDP using gensim in Python but as my corpus is too large it is throwing me following error: model = gensim.models.HdpModel(corpus, id2word=corpus.id2word, chunksize=50000) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 210, in __init__ self.update(corpus) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 245, in update self.update_chunk(chunk) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 313, in update_chunk self.update_lambda(ss, word_list, opt_o) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 415, in update_lambda rhot * self.m_D * sstats.m_var_beta_ss / sstats.m_chunksize MemoryError I have loaded my corpus using following statement: corpus = gensim.corpora.MalletCorpus('chunk5000K_records.mallet') And the data …
Category: Data Science

Topic modelling with many synonyms - how to extract 'latent themes'

Here's my corpus { 0: "dogs are nice", # canines are friendly 1: "mutts are kind", # canines are friendly 2: "pooches are lovely", # canines are friendly ..., 3: "cats are mean", # felines are unfriendly 4: "moggies are nasty", # felines are unfriendly 5: "pussycats are unkind", # felines are unfriendly } As a human, the general topics I get from these documents are that: canines are friendly (0, 1, 2) felines are not friendly (3, 4, 5) …
Category: Data Science

Understanding output of gensim LDA topic modeling API

I was trying to understand gensim mallet wrapper for topic modeling as explained in this notebook. In point 11, it prepares corpus which if of format Term Document frequency: >>> print(corpus[:1]) # for 1st document >>> [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), …
Category: Data Science

Topic Modelling in an existing dataframe in python

I am trying to perform topic extraction in a panda dataframe. I am using LDA topic modeling in order to extract the topics in my dataframe. No problem. But, I would like to apply LDA topic modeling to each row in my dataframe. Current datafame: date cust_id words 3/14/2019 100001 samantha slip skirt pi ski 1/21/2020 10002 steel skirt solid greenish 5/19/2020 10003 arizona denim blouse d The dataframe I am looking for: date cust_id words topic 0 words topic …
Category: Data Science

Choice of the number of topics (clusters) in textual data

I have a social science background and I'm doing a text mining project. I'm looking for advice about the choice of the number of topics/clusters when analyzing textual data. In particular, I'm analyzing a dataset of more than 200000 tweets and I'm performing a Latent Dirichlet allocation model on them to find clusters that represent the main topics of the tweets of my dataset. However, I was trying to decide the optimal number of clusters but the results I'm finding …
Category: Data Science

Topic Modeling: LDA vs LSA vs ToPMine

I am new to Topic Modeling. Is it possible to implement ToPMine in Python? In a quick search, I can't seem to find any Python package with ToPMine. Is ToPMine better than LDA and LSA? I am aware that LDA & LSA have been around for a long time and widely used. Thank you
Category: Data Science

Deep Regression Ensembles(DRE) - text analysis

I read an article about Deep Regression Ensembles(DRE), which can outperform DNN using SDG. My question is could I use DRE in text classification? (for example, I can use it instead of LDA) What about sentimental analysis? or is it just a method for estimating time-series data? (I am not really a master at DL, and my supervisor sent me this article to use in my research, but I don't know where I should use it. My research field is …
Category: Data Science

Topic modelling on only 24 documents gives the same "topic" for any K

Description: I have 24 documents, each one of around 2.5K tokens. They are public speeches. My text preprocessing pipeline is a generic one, including punctuation removal, expansion of English contractions, removal of stopwords and tokenization. I have implemented and analyzed both Latent Dirichlet Allocation and Latent Semantic Analysis in Python and gensim. I am calculating the optimal number of topic by the topics' coherence. Issue: For any number of topics K (I have tried many, e.g. 10, 50, 100, 200) …
Category: Data Science

How to Combine tfidf with LSTM in keras?

I am classifying emails as spam or ham using LSTM and some of its modified form(by adding constitutional layer at the end). For converting documents into vectors I am using keras.text_to_sequences function. But now I want to use TfIdf with the LSTM can anyone tell me or share the code how to do it. Please also guide me if it is possible and good approach or not. If you are wondering why I would like to do this there are …
Topic: keras tfidf lda nlp
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.