lda - Geeks Mental

List of words cluster by topics

mathprogram

2022年6月1日 09:37

I have a list of words, these words correspond to labels in news and they are not duplicated. I would like to get a clustering for this list based on topics. I try with wordnet but I don't know how could get it if I only have a list of unique words and not more information. Thank you!

Topic: lda topic-model python

Category: Data Science

How to choose threshold for gensim Phrases when generating bigrams?

lefnire

2022年5月28日 08:02

I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA from gensim.models.phrases import Phrases, Phraser # 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization, etc docs = get_docs() phrases = Phrases(docs) bigram = Phraser(phrases) docs = [bigram[d] for d in docs] Phrases has min_count=5, threshold=10. I don't quite understand how they interact, they seem related? Anyway, I see threshold having values in different tutorials ranging 1->1000, described as important in determining the number of …

Topic: gensim text-mining lda nlp

Category: Data Science

Apply Labeled LDA on large data

Xiancheng Li

2022年5月28日 05:05

I'm using a dataset contains about 1.5M document. Each document comes with some keywords describing the topics of this document(Thus multi-labelled). Each document belongs to some authors(not just one author for a document). I wanted to find out the topics interested by each author by looking at documents they write. I'm currently looking an LDA variation (labeled-LDA proposed by D Ramaga: https://www.aclweb.org/anthology/D/D09/D09-1026.pdf .). I'm using all the documents in my dataset to train a model and using the model to …

Topic: supervised-learning text-mining lda python

Category: Data Science

reuse of LDA model for new data

El Pandario

2022年5月22日 10:56

I am working with the LDA (Latent Dirichlet Allocation) model from sklearn and I have a question about reusing the model I have. After training my model with data how do I use it to make a prediction on a new data? Basically the goal is to read content of an email. countVectorizer = CountVectorizer(stop_words=stop_words) termFrequency = countVectorizer.fit_transform(corpus) featureNames = countVectorizer.get_feature_names() model = LatentDirichletAllocation(n_components=3) model.fit(termFrequency) joblib.dump(model, 'lda.pkl') # lda_from_joblib = joblib.load('lda.pkl') I save my model using joblib. Now I want …

Topic: unsupervised-learning scikit-learn lda machine-learning

Category: Data Science

What does the alpha and beta hyperparameters contribute to in Latent Dirichlet allocation?

alvas

2022年5月20日 16:20

LDA has two hyperparameters, tuning them changes the induced topics. What does the alpha and beta hyperparameters contribute to LDA? How does the topic change if one or the other hyperparameters increase or decrease? Why are they hyperparamters and not just parameters?

Topic: parameter lda topic-model

Category: Data Science

Calculate an ambiguity score based on topic models and Hellinger distance

Sebastian Galan

2022年5月13日 16:43

I am trying to calculate some sort of ambiguity score from text based on topic probabilities from a Latent Dirichlet Allocation model and the Hellinger distance between the topic distributions. Let’s say I constructed my LDA model with 3 topics, these topics are related to basketball, football, and banking, respectively. I would like some kind of score that says that if the topic probabilities of a document is Basketball: $0.33$, Football: $0.33$, and Banking: $0.33$, that document is more ambiguous …

Topic: lda topic-model nlp

Category: Data Science

How to construct the document-topic matrix using the word-topic and topic-word matrix calculated using Latent Dirichlet Allocation?

blpasd

2022年5月10日 17:07

How to construct the document-topic matrix using the word-topic and topic-word matrix calculated using Latent Dirichlet Allocation? I can not seem to find it anywhere, even not from the author of LDA, M.Blei. Gensim and sklearn just work, but I want to know how to use the two matrices to construct the document topic-matrix (Spark MLLIB LDA only gives me the 2 matrices and not the document-topic matrix).

Topic: apache-spark lda python

Category: Data Science

Are the word of women and men different when expressing their views on the same subject?

nem0

2022年4月30日 20:29

My data includes women's comments on X and Y and men's comments on X and Y. Each comment is of equal length. I will calculate how much different the word choice between men and women when commenting on X. How can it do this?

Topic: gensim sentiment-analysis lda topic-model nlp

Category: Data Science

Memory error - Hierarchical Dirichlet Process, HDP gensim

work_in_progress

2022年4月26日 00:09

I am running Hierarchical Dirichlet Process, HDP using gensim in Python but as my corpus is too large it is throwing me following error: model = gensim.models.HdpModel(corpus, id2word=corpus.id2word, chunksize=50000) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 210, in __init__ self.update(corpus) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 245, in update self.update_chunk(chunk) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 313, in update_chunk self.update_lambda(ss, word_list, opt_o) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 415, in update_lambda rhot * self.m_D * sstats.m_var_beta_ss / sstats.m_chunksize MemoryError I have loaded my corpus using following statement: corpus = gensim.corpora.MalletCorpus('chunk5000K_records.mallet') And the data …

Topic: gensim lda topic-model nlp python

Category: Data Science

Topic modelling with many synonyms - how to extract 'latent themes'

Ben

2022年4月25日 00:02

Here's my corpus { 0: "dogs are nice", # canines are friendly 1: "mutts are kind", # canines are friendly 2: "pooches are lovely", # canines are friendly ..., 3: "cats are mean", # felines are unfriendly 4: "moggies are nasty", # felines are unfriendly 5: "pussycats are unkind", # felines are unfriendly } As a human, the general topics I get from these documents are that: canines are friendly (0, 1, 2) felines are not friendly (3, 4, 5) …

Topic: lda topic-model nlp

Category: Data Science

Understanding output of gensim LDA topic modeling API

Maha

2022年4月12日 17:10

I was trying to understand gensim mallet wrapper for topic modeling as explained in this notebook. In point 11, it prepares corpus which if of format Term Document frequency: >>> print(corpus[:1]) # for 1st document >>> [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), …

Topic: gensim lda topic-model machine-learning

Category: Data Science

extract document topic vectors from lda model

Syrinebh

2022年4月8日 05:04

how can I extract document-topic matrix from LDA model and use it as input features an svm classifier? I am using gensim for implementation

Topic: gensim classification feature-extraction lda python

Category: Data Science

Topic Modelling in an existing dataframe in python

Aliya Leigh

2022年4月6日 12:25

I am trying to perform topic extraction in a panda dataframe. I am using LDA topic modeling in order to extract the topics in my dataframe. No problem. But, I would like to apply LDA topic modeling to each row in my dataframe. Current datafame: date cust_id words 3/14/2019 100001 samantha slip skirt pi ski 1/21/2020 10002 steel skirt solid greenish 5/19/2020 10003 arizona denim blouse d The dataframe I am looking for: date cust_id words topic 0 words topic …

Topic: dataframe pandas lda topic-model python

Category: Data Science

Choice of the number of topics (clusters) in textual data

Himan

2022年4月6日 11:51

I have a social science background and I'm doing a text mining project. I'm looking for advice about the choice of the number of topics/clusters when analyzing textual data. In particular, I'm analyzing a dataset of more than 200000 tweets and I'm performing a Latent Dirichlet allocation model on them to find clusters that represent the main topics of the tweets of my dataset. However, I was trying to decide the optimal number of clusters but the results I'm finding …

Topic: text-mining lda topic-model nlp clustering

Category: Data Science

Topic Modeling: LDA vs LSA vs ToPMine

Peter

2022年4月6日 11:44

I am new to Topic Modeling. Is it possible to implement ToPMine in Python? In a quick search, I can't seem to find any Python package with ToPMine. Is ToPMine better than LDA and LSA? I am aware that LDA & LSA have been around for a long time and widely used. Thank you

Topic: lda topic-model

Category: Data Science

Implementation of LDA (Latent Dirichlet Allocation) for classification tasks

Simone

2022年4月6日 08:03

Until now I have used LDA only for topic modelling. I would like to know which is the simplest implementation of LDA algorithm for classification tasks.

Topic: classification lda nlp

Category: Data Science

Deep Regression Ensembles(DRE) - text analysis

Amin Sadeghi

2022年3月28日 07:32

I read an article about Deep Regression Ensembles(DRE), which can outperform DNN using SDG. My question is could I use DRE in text classification? (for example, I can use it instead of LDA) What about sentimental analysis? or is it just a method for estimating time-series data? (I am not really a master at DL, and my supervisor sent me this article to use in my research, but I don't know where I should use it. My research field is …

Topic: text-classification deep-learning text-mining lda machine-learning

Category: Data Science

Topic modelling on only 24 documents gives the same "topic" for any K

Luisda

2022年3月19日 23:00

Description: I have 24 documents, each one of around 2.5K tokens. They are public speeches. My text preprocessing pipeline is a generic one, including punctuation removal, expansion of English contractions, removal of stopwords and tokenization. I have implemented and analyzed both Latent Dirichlet Allocation and Latent Semantic Analysis in Python and gensim. I am calculating the optimal number of topic by the topics' coherence. Issue: For any number of topics K (I have tried many, e.g. 10, 50, 100, 200) …

Topic: lsi gensim lda topic-model

Category: Data Science

How to map topic to a document after topic modeling is done with LDA

user1845926

2022年3月18日 03:06

Is there any way I can map generated topic from LDA to the list of documents and identify to which topic it belongs to ? I am interested in clustering documents using unsupervised learning and segregating it into appropriate cluster. Any link, code example, paper will greatly be appreciated.

Topic: lda topic-model

Category: Data Science

How to Combine tfidf with LSTM in keras?

AQEEL ALTAF

2022年3月10日 20:01

I am classifying emails as spam or ham using LSTM and some of its modified form(by adding constitutional layer at the end). For converting documents into vectors I am using keras.text_to_sequences function. But now I want to use TfIdf with the LSTM can anyone tell me or share the code how to do it. Please also guide me if it is possible and good approach or not. If you are wondering why I would like to do this there are …

Topic: keras tfidf lda nlp

Category: Data Science

About