word2vec

Using BERT instead of word2vec to extract most similar words to a given word

Maitha Alnaqbi

2022年6月2日 19:59

I am fairly new to BERT, and I am willing to test two approaches to get "the most similar words" to a given word to use in Snorkel labeling functions for weak supervision. Fist approach was to use word2vec with pre-trained word embedding of "word2vec-google-news-300" to find the most similar words @labeling_function() def lf_find_good_synonyms(x): good_synonyms = word_vectors.most_similar("good", topn=25) ##Similar words are extracted here good_list = syn_list(good_synonyms) ##syn_list just returns the stemmed similar word return POSITIVE if any(word in x.stemmed for …

Topic: snorkel bert word2vec nlp

Category: Data Science

Sequence models word2vec

stats

2022年5月28日 00:07

I am working a data-set with more than 100,000 records. This is how the data looks like: email_id cust_id campaign_name 123 4567 World of Zoro 123 4567 Boho XYz 123 4567 Guess ABC 234 5678 Anniversary X 234 5678 World of Zoro 234 5678 Fathers day 234 5678 Mothers day 345 7890 Clearance event 345 7890 Fathers day 345 7890 Mothers day 345 7890 Boho XYZ 345 7890 Guess ABC 345 7890 Sale I am trying to understand the campaign …

Topic: word2vec sequence neural-network nlp machine-learning

Category: Data Science

Initializing weights that are a pointwise product of multiple variables

Witiko

2022年5月21日 17:05

In two-layer perceptrons that slide across words of text, such as word2vec and fastText, hidden layer heights may be a product of two random variables such as positional embeddings and word embeddings (Mikolov et al. 2017, Section 2.2): $$v_c = \sum_{p\in P} d_p \odot u_{t+p}$$ However, it's unclear to me how to best initialize the two variables. When only word embeddings are used for the hidden layer weights, word2vec and fastText initialize them to $\mathcal{U}(-1 / \text{fan_out}; 1 / \text{fan_out})$. …

Topic: fasttext weight-initialization word2vec word-embeddings nlp

Category: Data Science

Training Word2Vec with names instead of sentences

Krishna Kalyan

2022年5月19日 10:06

I have scientific database with articles and coauthors. using this database I am training word2vec model on co-authors. Use use case here is to disambiguate authors. I was wondering my approach here can be improved or any suggestions will greatly be appreciated. Code

Topic: word2vec word-embeddings nlp python

Category: Data Science

Cat2Vec implementation X = categorical and y = categorical

Snader

2022年5月18日 15:23

I am trying to convert categorical values (zipcodes) with Cat2Vec into a matrix which can be used as an input shape for categorical prediction of a target with binary values. After reading several articles, among which: https://www.yanxishe.com/TextTranslation/1656?from=csdn I am having trouble to understand two things: 1) With respect to which y in Cat2Vec encoding are you creating embeddings. Is it with respect to the actual target in the dataset you are trying to predict, or can you randomly choose any …

Topic: categorical-encoding word2vec deep-learning

Category: Data Science

Why is Word2vec regarded as a neural embedding?

CMB

2022年5月9日 03:35

In the skip-gram model, the probability that a word $w$ is part of the set of context words $\{w_o^{(i)}\}$ $(i= 1:m)$ where $m$ is the context window around the central word, is given by: $$p(w_o | w_c) = \frac{\exp{(\vec{u_o}\cdot \vec{v_c)}}}{\sum_{i\in V}\exp{(\vec{u_i}\cdot \vec{v_c)}}} $$ where $V$ is the number of words in the training set, $\vec{u_i}$ is the word embedding for the context word and $\vec{v_i}$ is the word embedding for the central word. But this type of model is defining …

Topic: multilabel-classification word2vec word-embeddings logistic-regression neural-network

Category: Data Science

Keep word2vexc/fasttext model loaded in memory without using API

pramesh

2022年5月8日 18:07

I have to use Fasttext model to return word embeddings. In test I was calling it through API. Since there are too many words to compute embeddings, API call seems to be expensive. I would like to use fasttext without API. For that I need to load the model once and keep it in memory for further calls. How can this be done without using API. Any help is highly appreciated.

Topic: fasttext word2vec word-embeddings nlp

Category: Data Science

How to compute sentence embedding from word2vec model?

ixperdomo

2022年5月7日 18:02

I am new to NLP and I'm trying to perform embedding for a clustering problem. I have created the word2vec model using Python's gensim library, but I am wondering the following: The word2vec model embeds the words to vectors of size vector_size. However, in further steps of the clustering approach, I realised I was clustering based on single words instead of the sentences I had in my dataset at the beginning. Let's say my vocabulary is composed of the two …

Topic: word2vec word-embeddings nlp python

Category: Data Science

Sum vs mean of word-embeddings for sentence similarity

CutePoison

2022年5月6日 13:56

So, say I have the following sentences ["The dog says woof", "a king leads the country", "an apple is red"] I can embed each word using an N dimensional vector, and represent each sentence as either the sum or mean of all the words in the sentence (e.g Word2Vec). When we represent the words as vectors we can do something like vector(king)-vector(man)+vector(woman) = vector(queen) which then combines the different "meanings" of each vector and create a new, where the mean …

Topic: word2vec word-embeddings nlp

Category: Data Science

Does word2vec fail for window size equal to sentence size

jumbodrawn

2022年5月1日 18:00

Will word2vec fail if sentences contain only similar words, or in other words, if the window size is equal to the sentence size? I suppose this question boils down to whether word to vec considers words from other sentences as negative samples, or only words from the same sentence but outside of the window

Topic: word2vec word-embeddings nlp

Category: Data Science

Why activation function is not needed during the runtime of an Word2Vec model

nad

2022年5月1日 13:07

In Word2Vec trainable model, there are two different weight matrix. The matrix $W$ from input-to-hidden layer and the matrix $W'$ from hidden-to-output layer. Referring to this article, I understand that the reason we have the matrix $W'$ is basically to compensate for the lack of activation function in the output layer. As activation function is not needed during runtime, there is no activation function in the output layer. But we need to update the input-to-hidden layer weight matrix $W$ through …

Topic: activation-function word2vec

Category: Data Science

Semantic network using word2vec

Math

2022年4月27日 11:01

I have thousands of headlines and I would like to build a semantic network using word2vec, specifically google news files. My sentences look like Titles Dogs are humans’ best friends A dog died because of an accident You can clean dogs’ paws using natural products. A cat was found in the kitchen And so on. What I would like to do is finding some specific pattern within this data, e.g. similarity in topics on dogs and cats, using semantic networks. …

Topic: semantic-similarity word2vec neural-network nlp python

Category: Data Science

Predicting word from a set of words

Oren Matar

2022年4月26日 20:01

My task is to predict relevant words based on a short description of an idea. for example "SQL is a domain-specific language used in programming and designed for managing data held in a relational database" should produce words like "mysql", "Oracle", "Sybase", "Microsoft SQL Server" etc... My thinking is to treat the initial text as a set of words (after lemmatization and stop words removal) and predict words that should be in that set. I can then take all of …

Topic: bert word2vec word-embeddings neural-network nlp

Category: Data Science

Learning similarity of representations

user10283726

2022年4月20日 19:02

I am interested in a framework for learning the similarity of different input representations based on some common context. I have looked into word2vec, SVD and other recommender systems, which does more or less what I want. I want to know if anyone here has any experience or resources on a more generalized version of this, where I am able to feed in representations on different objects, and learn how similar they are. For example: Say we have some customers …

Topic: word2vec deep-learning similarity recommender-system

Category: Data Science

How can I use all possible spelling correction of documents before clustering those documents?

wrufesh

2022年4月18日 14:00

I have the data set with many documents of 50 to 100 words each. I need to clean those data by correcting misspelled words in those documents. I have an algorithm which predicts possible correct words for misspelled word. The problem is I need to choose or verify the predictions made by that algorithm in order to clean the spelling errors in the documents. Can I use all the possible correct words predicted for correct spelling in word vector to …

Topic: word2vec nlp

Category: Data Science

Can we use doc2vec to detect outlier documents?

J Cena

2022年4月18日 12:03

I have a set of documents and I want to identify and remove the outlier documents. I am just wondering if doc2vec can be used for this task. Or are there any recently evolved, promising algorithms that I can use for this task? EDIT I am currently using a bag of words model to identify outliers.

Topic: gensim word2vec outlier nlp data-mining

Category: Data Science

Dot product for similarity in word to vector computation in NLP

Vivek Dani

2022年4月17日 17:06

In NLP while computing word to vector we try to maximize log(P(o|c)). Where P(o|c) is probability that o is outside word, given that c is center word. Uo is word vector for outside word Vc is word vector for center word T is number of words in vocabulary Above equation is softmax. And dot product of Uo and Vc acts as score, which should be higher the better. If words o and c are closer then their dot product should …

Topic: softmax word2vec word-embeddings nlp similarity

Category: Data Science

Why we need to 'train word2vec' when word2vec itself is said to be 'pretrained'?

Hing

2022年4月15日 10:55

I get really confused on why we need to 'train word2vec' when word2vec itself is said to be 'pretrained'? I searched for word2vec pretrained embedding, thinking i can get a mapping table directly mapping my vocab on my dataset to a pretrained embedding but to no avail. Instead, what I only find is how we literally train our own: Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4) But I'm confused: isn't word2vec already pretrained? Why do we need to 'train' it again? If …

Topic: word2vec word-embeddings nlp

Category: Data Science

When would you use word2vec over BERT?

newuser11111

2022年4月15日 10:22

I am very new to Machine Learning and I have recently been exposed to word2vec and BERT. From what I know, word2vec provides a vector representation of words, but is limited to its dictionary definition. This would mean the algorithm may output the unwanted definition of a word with multiple meanings. BERT on the other hand, is able to use context clues in the sentence to describe the true meaning of the word. To me, it sounds like BERT would …

Topic: bert word2vec

Category: Data Science

Is it meaningful to use word2vec for non-string inputs like time series analysis?

irmgnr

2022年4月14日 16:04

I am working on a project that detects anomalies in a time series. I wonder if I can use word2vec for anomaly detection for non-string inputs like exchange rates?

Topic: word2vec anomaly-detection deep-learning time-series

Category: Data Science

About