Embedding from Transformer-based model from paragraph or documnet (like Doc2Vec)

I have a set of data that contains the different lengths of sequences. On average the sequence length is 600. The dataset is like this: S1 = ['Walk','Eat','Going school','Eat','Watching movie','Walk'......,'Sleep'] S2 = ['Eat','Eat','Going school','Walk','Walk','Watching movie'.......,'Eat'] ......................................... ......................................... S50 = ['Walk','Going school','Eat','Eat','Watching movie','Sleep',.......,'Walk'] The number of unique actions in the dataset are fixed. That means some sentences may not contain all of the actions. By using Doc2Vec (Gensim library particularly), I was able to extract embedding for each of the sequences …
Category: Data Science

DBSCAN on textual and numerical columns

I have a dataset which has two columns: title price sentence1 12 sentence2 13 I have used doc2vec to convert the sentences into vectors of size 100 as below: LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) cores = multiprocessing.cpu_count() d2v_model = Doc2Vec(dm=1, vector_size=100, negative=5, hs=0, min_count=2, sample = 0, workers=cores, alpha=0.025, min_alpha=0.001) d2v_model.build_vocab([x for x in tqdm(all_content)]) all_content = utils.shuffle(all_content) d2v_model.train(all_content,total_examples=len(all_content), epochs=30) So d2v_model.docvecs.doctag_syn0 returns me vectors of …
Category: Data Science

Clustering using both text and numerical features

I have a dataset that contains 2 types of features, one is generated from doc2vec and one is numerical feature. I would like to perform clustering analysis on them. However, due to the size of doc2vec features, if I simply combine them into one array, clustering algorithm would distribute the "weight" on the doc2vec features more, how do I overcome this problem? For example, for a given label, say I have features from doc2vec that look like this [1,2,3,4,5], and …
Category: Data Science

How to implement LSTM using Doc2Vec vectors to get representation?

Hi all. I'm a newbie in ML. I read and found a paper about A Multi-Level Plagiarism Detection System Based on Deep Learning Algorithms and want to implement this model . But I can't find more about step-by-step guide to build it. How LSTM can make representation with input is list vector of sentence trained by Doc2vec.
Category: Data Science

Difference between Doc2Vec and BERT

I am trying to understand the difference between Doc2Vec and BERT. I do understand that doc2vec uses a paragraph ID which also serves as a paragraph vector. I am not sure though if that paragraph ID serves in better able to understand the context in that vector? Moreover, BERT definitely understands the context and attributes different vectors for words such as "Bank". for instance, I robbed a bank I was sitting by the bank of a river. BERT would allocate …
Category: Data Science

Treating Word Embeddings as Multivariate Gaussian Random Variables

I want to specify some probabilistic clustering model (such as a mixture model or lda) over words, and instead of using the traditional method of representing words as an indicator vector , I want to use the corresponding word embeddings extracted from word2vec, glove, etc. as input. While treating word embeddings from my word2vec as an input to my GMM model, I observed that my word embeddings for each feature had a normal distribution, i.e. feature 1..100 were normally distributed …
Category: Data Science

Is there a way to train Doc2Vec on a corpus of docs and be able to take a novel doc and see how similar it is to the trained corpus?

I have a project idea, where I train a bunch of documents on Doc2Vec and then take a novel, input doc, and ideally be able to be told how similar it is to the docs supplied for training as a whole or how well it "fits" with the training docs. Is there a way to do this?
Category: Data Science

How to examine if a Doc2Vec model is sufficiently trained?

I started experimenting with gensim's Doc2Vec for sentiment analysis. For the training of the embedding itself, I have seen examples using a reduced learning rate with a few 10s or even a few hundred epochs. However, there does not seem to be a straightforward way to use early stopping to prevent overfitting, and it is not yet clear to me how I should access loss values for each epoch to detect overfitting. What should be the proper way to examine …
Category: Data Science

Imbalanced Classification: BOW vs doc2Vec in XGBoost with sample weights

I am new to machine learning. I have an imbalanced dataset of pages of reports with class 1: 97%, class 2: 2.2% class 3: 0.25% which are the different type of pages I am mostly concerned with correctly predicting class 2 & 3. I tried doc2Vec with XGBoost (with sample weight to correct the imbalanced classes) BOW with XGBoost (with sample weight to correct the imbalanced classes) Oddly, 2 outperformed 1. I thought doc2Vec should be better as it creates …
Category: Data Science

classification of similar text input features with text output label

I hope somebody can provide guidance/input/advice on my project, where I believe AI can help. I have a general understanding of AI, but I lack a formal training. I've never built a neural net from scratch on my own. Task Build a classification model able to assign labels to input text data. Differently from a textbook example, the input is free text, so neither categorical nor numerical. To complicate matters, the predictors in the training data I use are often …
Category: Data Science

Preprocessing for Document Similarity Using Doc2Vec

I'm trying to determine document similarity using Doc2Vec on a large series of legal opinions, which can contain some highly jargonistic language and phrases (e.g. en banc, de novo, etc.). I'm wondering if anyone has any thoughts about the criteria I should consider, if any, about how to treat compound words/phrases in Doc2Vec for the purposes of calculating similarity. Were I just using tf-idf or something more straightforward, I'd consider going through each phrase and combining the words manually during …
Category: Data Science

Word2Vec vs. Doc2Vec Word Vectors

I am doing some analysis on document similarity and was also interested in word similarity. I know that doc2vec inherits from word2vec and by default trains using word vectors which we can access. My question is: Should we expect these word vectors and by association any of the methods such as most_similar to be 'better' than word2vec or are they essentially going to be the same? If in the future I only wanted word similarity should I just default to …
Category: Data Science

What is the meaning of, or explanation for, having multiple tags in a Doc2Vec model's TaggedDocuments?

I've tried reading the other answers on this topic but I'm unsure if I understand completely. For my dataset, I have a series of tagged documents, "good" or "bad." Each document belongs to an entity, and each entity has a different number of documents. Eventually, I'd like to create a classifier to detect whether or not an entity's document is good or bad and to also see what sentences are most similar to the good/bad tag. All that being said, …
Category: Data Science

doc2vec - paragraph or article as document

I'm trying to train a doc2vec model on the German wiki corpus. While looking for the best practice I've found different possibilities on how to create the training data. Should I split every Wikipedia article by each natural paragraph into several documents or use one article as a document to train my model? EDIT: Is there an estimate on how many words per document for doc2vec?
Category: Data Science

Document Similarity to List of Words in Sentiment Analysis

How would you go about finding document similarity to a list of words in Sentiment Analysis? Looking find document similarity to multiple lists of words in sentiment analysis. I had been working on this with my intern but he is sorting by sentiment average to find the most similar score of each list or combinations of the list of words. I assume this isn't the best approach, I was thinking it should be a separate thing like below and I …
Category: Data Science

Word2Vec with CNN

I am trying to classify documents using CNN (convolutional neural network) with Word2Vec embeddings. However to do this, it requires me to trim all texts to the same length. I just pad all the training documents to the size of the longest, and I don't think this is the best solutions, as during the testing phase, there can come a longer document and I may remove a significant part of it by trimming. I found that there is Doc2Vec, which …
Category: Data Science

Topic alignment / topic modelling

What is the most efficient method for detecting whether the article is mostly about a specific topic, but without lots of data for training? My task is to determine how much a document is e.g. about the weather or holidays or several other specific topics. I was looking towards LDA and TFIDF but from what I understand this approach is unsupervised and works well for clustering/grouping large number of documents based on vocabulary frequency. These techniques have a limitation in …
Category: Data Science

T-SNE good clustering but SVM classification poor

I am trying to classify in 4 different classes, paragraph embedding vector computed with doc2vec using an non-linear svm over them. When I visualize the embeddings using tensorboard t-sne I can see that they are clustered quite well as in the image. However, when I train the svm (with rbf kernel and grid search) I obtain an f1-score of 60% that given the figure seems quite low. Is it common to obtain good cluster with t-sne and bad results with …
Category: Data Science

Use embeddings to find similarity between documents

I need to find cosine similarity between two text documents. I need embeddings that reflect order of the word sequence, so I don't plan to use document vectors built with bag of words or TF/IDF. Ideally I would use pre-trained document embeddings such as doc2vec from Gensim. How to map new documents to pre-trained embeddings ? Otherwise what would be the easiest way to generate document embeddings in Keras/Tensorflow or Pytorch?
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.