gensim

Does spaCy support multiple GPUs?

Jinhua Wang

2022年5月29日 20:33

I was wondering if spaCy supports multi-GPU via mpi4py? I am currently using spaCy's nlp.pipe for Named Entity Recognition on a high-performance-computing cluster that supports the MPI protocol and has many GPUs. It says here that I would need to specify the GPU to use with cupy, but with PyMPI, I am not sure if the following will work (should I import spacy after calling cupy device?): from mpi4py import MPI import cupy comm = MPI.COMM_WORLD rank = comm.Get_rank() if …

Topic: spacy hpc gensim nlp python

Category: Data Science

How to choose threshold for gensim Phrases when generating bigrams?

lefnire

2022年5月28日 08:02

I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA from gensim.models.phrases import Phrases, Phraser # 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization, etc docs = get_docs() phrases = Phrases(docs) bigram = Phraser(phrases) docs = [bigram[d] for d in docs] Phrases has min_count=5, threshold=10. I don't quite understand how they interact, they seem related? Anyway, I see threshold having values in different tutorials ranging 1->1000, described as important in determining the number of …

Topic: gensim text-mining lda nlp

Category: Data Science

Are the word of women and men different when expressing their views on the same subject?

nem0

2022年4月30日 20:29

My data includes women's comments on X and Y and men's comments on X and Y. Each comment is of equal length. I will calculate how much different the word choice between men and women when commenting on X. How can it do this?

Topic: gensim sentiment-analysis lda topic-model nlp

Category: Data Science

Memory error - Hierarchical Dirichlet Process, HDP gensim

work_in_progress

2022年4月26日 00:09

I am running Hierarchical Dirichlet Process, HDP using gensim in Python but as my corpus is too large it is throwing me following error: model = gensim.models.HdpModel(corpus, id2word=corpus.id2word, chunksize=50000) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 210, in __init__ self.update(corpus) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 245, in update self.update_chunk(chunk) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 313, in update_chunk self.update_lambda(ss, word_list, opt_o) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 415, in update_lambda rhot * self.m_D * sstats.m_var_beta_ss / sstats.m_chunksize MemoryError I have loaded my corpus using following statement: corpus = gensim.corpora.MalletCorpus('chunk5000K_records.mallet') And the data …

Topic: gensim lda topic-model nlp python

Category: Data Science

Can we use doc2vec to detect outlier documents?

J Cena

2022年4月18日 12:03

I have a set of documents and I want to identify and remove the outlier documents. I am just wondering if doc2vec can be used for this task. Or are there any recently evolved, promising algorithms that I can use for this task? EDIT I am currently using a bag of words model to identify outliers.

Topic: gensim word2vec outlier nlp data-mining

Category: Data Science

Understanding output of gensim LDA topic modeling API

Maha

2022年4月12日 17:10

I was trying to understand gensim mallet wrapper for topic modeling as explained in this notebook. In point 11, it prepares corpus which if of format Term Document frequency: >>> print(corpus[:1]) # for 1st document >>> [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), …

Topic: gensim lda topic-model machine-learning

Category: Data Science

extract document topic vectors from lda model

Syrinebh

2022年4月8日 05:04

how can I extract document-topic matrix from LDA model and use it as input features an svm classifier? I am using gensim for implementation

Topic: gensim classification feature-extraction lda python

Category: Data Science

Fine-tuning pre-trained Word2Vec model with Gensim 4.0

NST

2022年4月7日 10:04

With Gensim < 4.0, we can retrain a word2vec model using the following code: model = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True) model.train(my_corpus, total_examples=len(my_corpus), epochs=model.epochs) However, what I understand is that Gensim 4.0 is no longer supporting Word2Vec.load_word2vec_format. Instead, I can only load the keyedVectors. How to fine-tune a pre-trained word2vec model (such as the model trained on GoogleNews) with my domain-specific corpus using Gensim 4.0?

Topic: pretraining transfer-learning gensim word2vec

Category: Data Science

Hellinger Distance in Gensim

Smith Volka

2022年3月29日 05:01

I have set of documents as follows where each document has set of words that represents the content of it. Doc1: {fish, moose, wildlife, hunting, bears, polar} Doc2: {energy, fuel, costs, oil, gas} Doc3: {wildlife, hunt, polar, fishing} So, if I look at my documents I can deduce that Doc1 and Doc3 are very much similar. I want distance metrics for bag-of-words. I followed some tutorials in Gensim about how to do it. However, as I understand, initially they train …

Topic: gensim text-mining topic-model data-mining machine-learning

Category: Data Science

Siamese networks vs Semantic similarity (may be gensim)

Sandeep Bhutani

2022年3月27日 18:06

I am trying to understand the Siamese networks . In this vector is calculated for an object (say an image) and a distance metric is applied (say manhatten) on two vectors produced by the neural network(s). The idea was applied mostly to images in the tutorials provided on internet. If I compare it with Gensim semantic similarity, there also we have vectors of two objects (words or sentences) and then do a cosine similarity to calculate the difference. (remember example …

Topic: semantic-similarity siamese-networks cnn gensim cosine-distance

Category: Data Science

Topic modelling on only 24 documents gives the same "topic" for any K

Luisda

2022年3月19日 23:00

Description: I have 24 documents, each one of around 2.5K tokens. They are public speeches. My text preprocessing pipeline is a generic one, including punctuation removal, expansion of English contractions, removal of stopwords and tokenization. I have implemented and analyzed both Latent Dirichlet Allocation and Latent Semantic Analysis in Python and gensim. I am calculating the optimal number of topic by the topics' coherence. Issue: For any number of topics K (I have tried many, e.g. 10, 50, 100, 200) …

Topic: lsi gensim lda topic-model

Category: Data Science

Recommend products based on historical queries of other users

william007

2022年3月17日 03:02

Given the user data as in the following: user query date 0 jack mango 2020-01-03 1 jack banana 2020-01-04 2 jack apple 2020-02-03 3 jack orange 2020-03-03 4 john meat 2020-07-03 5 john water 2020-07-03 Now assume we have a new user enter mango, I am finding a good way to recommend user product. One approach is the following based on item2vec: import pandas as pd df_user= pd.DataFrame( {'user':['jack','jack','jack','jack','john','john'],'query':['mango','banana', 'apple','orange','meat', 'water'],'date':['2020-1-3','2020-1-4','2020-2-3','2020-3-3','2020-7-3','2020-7-3']}) df_user['date']=pd.to_datetime(df_user['date']) new_query='mango' from gensim.models import Word2Vec model = Word2Vec(sentences …

Topic: gensim word2vec

Category: Data Science

gensim word2vec results - why non-nearby word first?

william007

2022年3月15日 16:19

from gensim.models import Word2Vec model = Word2Vec(sentences = [['a','b'],['c','d']], window = 9999999, min_count=1) model.wv.most_similar('a', topn=10) Above code gives the following result: [('d', 0.06363436579704285), ('b', -0.010543467476963997), ('c', -0.039232250303030014)] shouldn't the 'b' ranked first, since it's the only one nearby 'a'?

Topic: gensim word2vec

Category: Data Science

How is determined the context's dimension in Doc2Vec?

Simone

2022年3月15日 06:05

I would like to know how is determined the dimension of the context in Gensim Doc2Vec.

Topic: gensim word2vec word-embeddings

Category: Data Science

How to calculate the mean average of word embedding and then compare strings using sklearn.metrics.pairwise

test

2022年3月14日 16:04

I am totally new to this topic, that's why I am so confused or stuck in this code for a while, but I am not sure how to solve it correctly. My goal is to write a short text embedding using vector representation from the text. The word embeddings are aggregated via mean averaging to infer a vector representation for the text. I generated model vectors using gensim.models and then I run each through the model and check if the …

Topic: gensim word2vec word-embeddings scikit-learn python

Category: Data Science

Fasttext error while loading wiki pre-trained data

rishi

2022年2月21日 06:06

I am loading the model using gensim package this way: from gensim.models import FastText model = FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin') as stated here. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 57: unexpected end of data The .bin file is downloaded from this source. How to load the model correctly?

Topic: gensim word2vec python

Category: Data Science

How to identify text similarity based on training data?

Shivam Agrawal

2022年2月18日 05:00

I have a set of documents (1 to 11) for which the labeling is done. Lets Assume: Doc No: 1,3,5,7 - Belongs to Type A Doc No: 2,4,9 - Belongs to Type B Doc No: 8,10 - Belongs to Type C Doc No, 6,11 - Belongs to No one Now, let us say I have new incoming docs - 11,12,13 .. and so on, and I would like to know which Type (A, B, C or none ) they belong …

Topic: text-classification gensim word2vec lda recommender-system

Category: Data Science

How to fit Word2Vec on test data?

spectre

2022年2月13日 05:47

I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way: # PREPROCESSING THE DATA # SPLITTING THE DATA from sklearn.model_selection import train_test_split train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y) train_x2 = train_x['review'].to_list() test_x2 = test_x['review'].to_list() # CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS train_x3 = [nltk.word_tokenize(k) for k in train_x2] test_x3 = [nltk.word_tokenize(k) for k in …

Topic: data-leakage gensim word2vec sentiment-analysis python

Category: Data Science

How to find similar document (using gensim) given two or more other documents?

Jason

2022年2月11日 22:47

I am developing a similarity program to compare documents, and I’ve successfully trained my model with Gensim (TFIDF and LSI) in order to compare two documents of each other, and it works great. I can give it document A, and get a list of documents that are similar to it. I wonder: is there a way to take multiple input documents and get a list of documents that are similar to them? I.e. I can give it documents A and …

Topic: gensim tfidf nlp python similarity

Category: Data Science

Length of document in doc2vec

Aishwarya A R

2022年2月8日 16:06

I have 100 sentences that I want to cluster based on similarity. I've used doc2vec to vectorize the sentences into 20 dimensional vectors and applied kmeans to cluster them. I haven't got the desired results yet. I've read that doc2vec performs well only on large datasets. I want to know if increasing the length of each data sample, would compensate for the low number of samples, and help the model train better? For example, if my sentences are originally "making …

Topic: similar-documents gensim python machine-learning

Category: Data Science

About