how to train custom word2vec embeddings to find related articles?

I am beginner in machine learning. My project is to make search engine based on AI which shows related articles when we search on website. For this i decided to train my own embedding. I found two methods for this: One is to train network to find next word( i.e inputs=[the quick,the quick brown,the quick brown fox] and outputs=[brown, fox,lazy] Other method is to train with nearest words(i.e [brown,fox],[brown,quick],[brown,quick]). Which method should i use and after training how should i …
Category: Data Science

How to get vector representations(or embeddings) of time series?

Even if a time series is constructed up of numbers only, finding abstract fixed-dim vector representation would be interesting for classification/clustering purposes. As we can learn & find abstract representations/embeddings of text/images, can we do something similar on Time series? Finding such ways would result in better clustering & related tasks instead of traditional ways using some statistical measures like Pearson correlation etc. All thoughts are welcome.
Category: Data Science

Transformer time series classification using time2vec positional embedding

I want to use a transformer model to do classification of fixed-length time series. I was following along this tutorial using keras which uses time2vec as a positional embedding. According to the original time2vec paper the representation is calculated as $$ \boldsymbol{t2v}(\tau)[i] = \begin{cases} \omega_i \tau + \phi_i,& i = 0\\ F(\omega_i \tau + \phi_i), & 1 \leq i \leq k \end{cases} $$ The mentioned tutorial simply concatenates this embedding with the input. Now, I understand the intention of the …
Category: Data Science

Can I get un-normalized vectors from the TF USE model?

I'm using this Universal Sentence Encoder (USE) model to get embeddings of a set of texts, each text corresponding to a newspaper article. In order to build a Recommender System, I generate user embeddings by averaging the embeddings of items a user has read, and then I look for other texts that are cosine-similar to this user (basically, the method returns a set of items that are similar to this user embedding). Now, the problem is that the mentioned model …
Category: Data Science

Embedding from Transformer-based model from paragraph or documnet (like Doc2Vec)

I have a set of data that contains the different lengths of sequences. On average the sequence length is 600. The dataset is like this: S1 = ['Walk','Eat','Going school','Eat','Watching movie','Walk'......,'Sleep'] S2 = ['Eat','Eat','Going school','Walk','Walk','Watching movie'.......,'Eat'] ......................................... ......................................... S50 = ['Walk','Going school','Eat','Eat','Watching movie','Sleep',.......,'Walk'] The number of unique actions in the dataset are fixed. That means some sentences may not contain all of the actions. By using Doc2Vec (Gensim library particularly), I was able to extract embedding for each of the sequences …
Category: Data Science

Discriminator of a Conditional GAN with continuous labels

OK, let's say we have well-labeled images with non-discrete labels such as brightness or size or something and we want to generate images based on it. If it were done with a discrete label it could be done like: def forward(self, inputs, label): self.batch = inputs.size(0) h = self.res1(inputs) h = self.attn(h) ... h = self.res5(h) h = torch.sum((F.leaky_relu(h,0.2)).view(self.batch,-1,4*4), dim=2) outputs = self.fc(h) if label is not None: embed = self.embedding(label) outputs += torch.sum(embed*h,dim=1,keepdim=True) The embedding can be made to …
Category: Data Science

TextVectorization and Autoencoder for feature extraction of text

I'm trying to solve a problem which is as follows: I need to train the autoencoder to extract useful data from text. I will use the trained autoencoder in another model to extract features. The goal is to teach the autocoder to compress the information and then reconstruct the exact same string. I solve the problem of classification for each letter. My dataset: X_train_autoencoder_raw: 15298 some text... 1127 some text... 22270 more text... ... Name: data, Length: 28235, dtype: object …
Category: Data Science

Keras: Softmax output into embedding layer

I'm trying to build an encoder-decoder network in Keras to generate a sentence of a particular style. As my problem is unsupervised i.e. I don't have the ground truths for the generated sentences, I use a classifier to help during training. I pass the decoder's output into the classifier to tell me what style the decoded sentence is. The decoder outputs a softmax distribution which I was intending to feed straight into the classifier but I realised that it has …
Category: Data Science

Triplet loss - what threshold to use to detect similarity between two embeddings?

I have trained my triplet loss model using FaceNet's architecture. I used 11k hands dataset. Now I want to see how well my model performed, so I feed it 2 images of the same class and get back their embeddings. I want to compare the distance between these embeddings and if that distance is not larger than some threshold I can say that the model correctly classifies these 2 images as of the same class. How do I select the …
Category: Data Science

Generalize min-max scaling to vectors

I am combining several vectors, where each vector is a certain kind of embedding of some object. Since each embedding is very different (some have all components between $[0, 1]$ some have components in the range of around 60 or 70 etc.) I want to rescale the vectors before combining them. I thought about using something like min-max rescaling, but I'm not sure how to generalize it to vectors. I could do something of the sort - $\frac{v-|v_{min}|}{|v_{max}|-|v_{min}|)}$ but I …
Category: Data Science

Is there a sensible notion of 'character embeddings'?

There are several popular word embeddings available (e.g., Fasttext and GloVe); In short, those embeddings are a tool to encode words along with a sensible notion of semantics attached to those words (i.e. words with similar sematics are nearly parallel). Question: Is there a similar notion of character embedding? By 'character embedding' I understand an algorithm that allow us to encode characters in order to capture some syntactic similarity (i.e. similarity of character shapes or contexts).
Category: Data Science

What are the differences between Knowledge Graph Embeddings (KGE) and Graph Neural Network (GNN)

From page 3 of this paper Knowledge Graph Embeddings and Explainable AI, they mentioned as below: Note that knowledge graph embeddings are different from Graph Neural Networks (GNNs). KG embedding models are in general shallow and linear models and should be distinguished from GNNs [78], which are neural networks that take relational structures as inputs However, it's still vague to me. It seems that we can get embeddings from both of them. What are the difference? How should we choose …
Category: Data Science

A way to init sentence embedding for unsupervised text clustering, better than glove wordvec?

For unsupervised text clustering, the key thing is the init embedding for text. If we want to use deepcluster for text, the problem for text is how to get the init embedding from deep model. BERT can not get good init embedding. If we do not use deep model, is there better way to get embedding better than glove wordvec?
Category: Data Science

How are the embedding and context matrices created and updated in word embedding?

I am struggling to understand how word embedding works, especially how the embedding matrix $W$ and context matrix $W'$ are created/updated. I understand that in the Input we may have a one-hot encoding of a given word, and that in the output we may have the word the most likely to be nearby this word $x_i$ Would you have any very simple mathematical example?
Category: Data Science

Are there any graph embedding algorithms like this already?

I wrote an algorithm for generating node embeddings based on the graph's topology. Most of the explanation is done in the readme file and the examples. The question is: Am I reinventing the wheel? Does this approach have any practical advantages over existing solutions for embeddings generation? Yes, I'm aware there are many algorithms for this based on random walks, but this one is pure deterministic linear algebra and it is quite simple, from my perspective. In short, the algorithm …
Category: Data Science

key generation from feature vectors in high dimentions

I welcome any suggestions to solve the following hard problem: I have a dataset of float feature vectors of size 512 where each feature vector is extracted from a face image. I want to generate a key given a feature vector (this key can be a number/binary code/etc) that is consistent to each person without comparisons between feature vectors. The only input I have is the given feature vector. for example if I see a photo of me I want …
Category: Data Science

Why does averaging word embedding vectors (exctracted from the NN embedding layer) work to represent sentences?

I'm puzzling to understand why the method of averaging word embeddings works in order to obtain sentence embedding, in particular considering the exercize of this post How to obtain vector representation of phrases using the embedding layer and do PCA with it. My current question actually is to understand the theory behind that more practical post. The answer to the question linked uses a method for sentence embedding that is averaging the word embeddings (in the most naive and simplest …
Category: Data Science

How to extract embeddings of categorical variables

I am little bit confused about encoding categorical variables. There are other posts/blogposts on this issue but none is talking about the problem I am facing. I have a dataset with mixed variables (i.e, numerical as well as categorical). Some of the categorical variables has a lot of categories (close to 100). So instead of using One Hot encoders, I am looking into using embeddings. My goal is to: Use the embeddings of the categorical variables and extract them and …
Category: Data Science

Graph embeddings of Wikidata items

I'm trying to use PyTorch BigGraph pre-trained embeddings of Wikidata items for disambiguation. The problem is that the results I am getting by using dot (or cosine) similarity are not great. For example, the similarity between the Python programming language and the snake with the same name is greater than between Python and Django. Does anybody know if there is a Wikidata embedding that results in better similarities? The only alternative I've found is Webmembedder embeddings but they are incomplete. …
Category: Data Science

How to choose the good number dimension of autoencoder?

I'm using Autoencoder for feature extracting. I stuck with how to choose good number of dimension of encoder layer (latent layer). After training dataset, the model gave the latent layer (embedding layer) with some zero value in the vector result. For example, the embedding layer have 4 dimensions, one of node (unit) in embedding layer has value [0.67 0.0 2.13 0.43]. That I suppose they should 4 values different zero value. I think my problem that I choose too many …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.