I was wondering if spaCy supports multi-GPU via mpi4py? I am currently using spaCy's nlp.pipe for Named Entity Recognition on a high-performance-computing cluster that supports the MPI protocol and has many GPUs. It says here that I would need to specify the GPU to use with cupy, but with PyMPI, I am not sure if the following will work (should I import spacy after calling cupy device?): from mpi4py import MPI import cupy comm = MPI.COMM_WORLD rank = comm.Get_rank() if …
I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA from gensim.models.phrases import Phrases, Phraser # 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization, etc docs = get_docs() phrases = Phrases(docs) bigram = Phraser(phrases) docs = [bigram[d] for d in docs] Phrases has min_count=5, threshold=10. I don't quite understand how they interact, they seem related? Anyway, I see threshold having values in different tutorials ranging 1->1000, described as important in determining the number of …
My data includes women's comments on X and Y and men's comments on X and Y. Each comment is of equal length. I will calculate how much different the word choice between men and women when commenting on X. How can it do this?
I am running Hierarchical Dirichlet Process, HDP using gensim in Python but as my corpus is too large it is throwing me following error: model = gensim.models.HdpModel(corpus, id2word=corpus.id2word, chunksize=50000) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 210, in __init__ self.update(corpus) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 245, in update self.update_chunk(chunk) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 313, in update_chunk self.update_lambda(ss, word_list, opt_o) File "/usr/cluster/contrib/Enthought/Canopy_64/User/lib/python2.7/site-packages/gensim/models/hdpmodel.py", line 415, in update_lambda rhot * self.m_D * sstats.m_var_beta_ss / sstats.m_chunksize MemoryError I have loaded my corpus using following statement: corpus = gensim.corpora.MalletCorpus('chunk5000K_records.mallet') And the data …
I have a set of documents and I want to identify and remove the outlier documents. I am just wondering if doc2vec can be used for this task. Or are there any recently evolved, promising algorithms that I can use for this task? EDIT I am currently using a bag of words model to identify outliers.
I was trying to understand gensim mallet wrapper for topic modeling as explained in this notebook. In point 11, it prepares corpus which if of format Term Document frequency: >>> print(corpus[:1]) # for 1st document >>> [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), …
With Gensim < 4.0, we can retrain a word2vec model using the following code: model = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True) model.train(my_corpus, total_examples=len(my_corpus), epochs=model.epochs) However, what I understand is that Gensim 4.0 is no longer supporting Word2Vec.load_word2vec_format. Instead, I can only load the keyedVectors. How to fine-tune a pre-trained word2vec model (such as the model trained on GoogleNews) with my domain-specific corpus using Gensim 4.0?
I have set of documents as follows where each document has set of words that represents the content of it. Doc1: {fish, moose, wildlife, hunting, bears, polar} Doc2: {energy, fuel, costs, oil, gas} Doc3: {wildlife, hunt, polar, fishing} So, if I look at my documents I can deduce that Doc1 and Doc3 are very much similar. I want distance metrics for bag-of-words. I followed some tutorials in Gensim about how to do it. However, as I understand, initially they train …
I am trying to understand the Siamese networks . In this vector is calculated for an object (say an image) and a distance metric is applied (say manhatten) on two vectors produced by the neural network(s). The idea was applied mostly to images in the tutorials provided on internet. If I compare it with Gensim semantic similarity, there also we have vectors of two objects (words or sentences) and then do a cosine similarity to calculate the difference. (remember example …
Description: I have 24 documents, each one of around 2.5K tokens. They are public speeches. My text preprocessing pipeline is a generic one, including punctuation removal, expansion of English contractions, removal of stopwords and tokenization. I have implemented and analyzed both Latent Dirichlet Allocation and Latent Semantic Analysis in Python and gensim. I am calculating the optimal number of topic by the topics' coherence. Issue: For any number of topics K (I have tried many, e.g. 10, 50, 100, 200) …
Given the user data as in the following: user query date 0 jack mango 2020-01-03 1 jack banana 2020-01-04 2 jack apple 2020-02-03 3 jack orange 2020-03-03 4 john meat 2020-07-03 5 john water 2020-07-03 Now assume we have a new user enter mango, I am finding a good way to recommend user product. One approach is the following based on item2vec: import pandas as pd df_user= pd.DataFrame( {'user':['jack','jack','jack','jack','john','john'],'query':['mango','banana', 'apple','orange','meat', 'water'],'date':['2020-1-3','2020-1-4','2020-2-3','2020-3-3','2020-7-3','2020-7-3']}) df_user['date']=pd.to_datetime(df_user['date']) new_query='mango' from gensim.models import Word2Vec model = Word2Vec(sentences …
from gensim.models import Word2Vec model = Word2Vec(sentences = [['a','b'],['c','d']], window = 9999999, min_count=1) model.wv.most_similar('a', topn=10) Above code gives the following result: [('d', 0.06363436579704285), ('b', -0.010543467476963997), ('c', -0.039232250303030014)] shouldn't the 'b' ranked first, since it's the only one nearby 'a'?
I am totally new to this topic, that's why I am so confused or stuck in this code for a while, but I am not sure how to solve it correctly. My goal is to write a short text embedding using vector representation from the text. The word embeddings are aggregated via mean averaging to infer a vector representation for the text. I generated model vectors using gensim.models and then I run each through the model and check if the …
I am loading the model using gensim package this way: from gensim.models import FastText model = FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin') as stated here. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 57: unexpected end of data The .bin file is downloaded from this source. How to load the model correctly?
I have a set of documents (1 to 11) for which the labeling is done. Lets Assume: Doc No: 1,3,5,7 - Belongs to Type A Doc No: 2,4,9 - Belongs to Type B Doc No: 8,10 - Belongs to Type C Doc No, 6,11 - Belongs to No one Now, let us say I have new incoming docs - 11,12,13 .. and so on, and I would like to know which Type (A, B, C or none ) they belong …
I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way: # PREPROCESSING THE DATA # SPLITTING THE DATA from sklearn.model_selection import train_test_split train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y) train_x2 = train_x['review'].to_list() test_x2 = test_x['review'].to_list() # CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS train_x3 = [nltk.word_tokenize(k) for k in train_x2] test_x3 = [nltk.word_tokenize(k) for k in …
I am developing a similarity program to compare documents, and I’ve successfully trained my model with Gensim (TFIDF and LSI) in order to compare two documents of each other, and it works great. I can give it document A, and get a list of documents that are similar to it. I wonder: is there a way to take multiple input documents and get a list of documents that are similar to them? I.e. I can give it documents A and …
I have 100 sentences that I want to cluster based on similarity. I've used doc2vec to vectorize the sentences into 20 dimensional vectors and applied kmeans to cluster them. I haven't got the desired results yet. I've read that doc2vec performs well only on large datasets. I want to know if increasing the length of each data sample, would compensate for the low number of samples, and help the model train better? For example, if my sentences are originally "making …