Why is 10000 used as the denominator in Positional Encodings in the Transformer Model?

I was working through the All you need is Attention paper, and while the motivation of positional encodings makes sense and the other stackexchange answers filled me in on the motivations of the structure of it, I still don't understand why $1/10000$ was used as the scaling factor for the $pos$ of a word. Why was this number chosen?
Category: Data Science

how to train custom word2vec embeddings to find related articles?

I am beginner in machine learning. My project is to make search engine based on AI which shows related articles when we search on website. For this i decided to train my own embedding. I found two methods for this: One is to train network to find next word( i.e inputs=[the quick,the quick brown,the quick brown fox] and outputs=[brown, fox,lazy] Other method is to train with nearest words(i.e [brown,fox],[brown,quick],[brown,quick]). Which method should i use and after training how should i …
Category: Data Science

Is it possible to add new vocabulary to BERT's tokenizer when fine-tuning?

I want to fine-tune BERT by training it on a domain dataset of my own. The domain is specific and includes many terms that probably weren't included in the original dataset BERT was trained on. I know I have to use BERT's tokenizer as the model was originally trained on its embeddings. To my understanding words unknown to the tokenizer will be masked with [UNKNOWN]. What if some of these words are common in my dataset? Does it make sense …
Category: Data Science

Cluster words into groups of similar meaning (synonyms)

How can words be clustered into groups of similar meaning (synonyms)? I started with pre-trained word embeddings (e.g., Google News), which is great, but not perfect - a limitation arises because the word embeddings are based on surrounding words. This introduces challenging results. For example: polar meanings: word embeddings might find opposites to be similar. Even though these words mean the opposite semantically, they can quite readily be interchanged given the same preceding and following words. For example, "terrible" and …
Category: Data Science

How to access an embedding table that is too large to fully load into memory?

I'm currently trying to find a way of loading/deserializing a .json file containing Flair word embeddings that is too large to fit in my RAM at once (>60GB .json with 32GB of RAM). My current code for loading the embedding is below. def get_embedding_table(config): words_id2vec = json.load(open(config.words_id2vector_filename, 'r')) words_vectors = [0] * len(words_id2vec) for id, vec in words_id2vec.items(): words_vectors[int(id)] = vec words_vectors.append(list(np.random.uniform(0, 1, config.embedding_dim))) words_embedding_table = tf.Variable(name='words_emb_table', initial_value=words_vectors, dtype=tf.float32) The rest of the code that I am trying to reproduce …
Category: Data Science

How to train millions of doc2vec embeddings using GPU?

I am trying to train a doc2vec based on user browsing history (urls tagged to user_id). I use chainer deep learning framework. There are more than 20 millions (user_id and urls) of embeddings to initialize which doesn’t fit in a GPU internal memory (maximum available 12 GB). Training on CPU is very slow. I am giving an attempt using code written in chainer given here Please advise options to try if any.
Category: Data Science

Which model is better able to understand the difference that two sentences are talking about different things?

I'm currently working on the task of measuring semantic proximity between sentences. I use fasttext train _unsiupervised (skipgram) for this. I extract the sentence embeddings and then measure the cosine similarity between them. however, I ran into the following problem: cosine similarity between embeddings of these sentences: "Create a documentation of product A"; "he is creating a documentation of product B" is very high (>0.9). obviously it because both of them is about creating a documentation. but however the first …
Category: Data Science

Can I get un-normalized vectors from the TF USE model?

I'm using this Universal Sentence Encoder (USE) model to get embeddings of a set of texts, each text corresponding to a newspaper article. In order to build a Recommender System, I generate user embeddings by averaging the embeddings of items a user has read, and then I look for other texts that are cosine-similar to this user (basically, the method returns a set of items that are similar to this user embedding). Now, the problem is that the mentioned model …
Category: Data Science

Initializing weights that are a pointwise product of multiple variables

In two-layer perceptrons that slide across words of text, such as word2vec and fastText, hidden layer heights may be a product of two random variables such as positional embeddings and word embeddings (Mikolov et al. 2017, Section 2.2): $$v_c = \sum_{p\in P} d_p \odot u_{t+p}$$ However, it's unclear to me how to best initialize the two variables. When only word embeddings are used for the hidden layer weights, word2vec and fastText initialize them to $\mathcal{U}(-1 / \text{fan_out}; 1 / \text{fan_out})$. …
Category: Data Science

How to train neural word embeddings?

So I am new to Deep Learning and NLP. I have read several blog posts on medium, towardsdatascience and papers where they talk about pre-training the word embeddings in an unsupervised fashion and then use them in supervised DNN. But recently I read a blog post which suggested that training the word embeddings while training the neural network gives better results. This is the other link. So my question is which one should I follow? Some YouTube videos that I …
Category: Data Science

Contextual word embeddings from pretrained word2vec vectors

I would like to create word embeddings that take context into account, so the vector of the word Jaguar [animal] would be different from the word Jaguar [car brand]. As you know, word2vec only gives one representation for a given word, and I would like to take already pretrained embeddings and enrich them with context. So far I've tried a simple way with taking an average vector of the word and category word, for example like this. Now I would …
Category: Data Science

Is there a sensible notion of 'character embeddings'?

There are several popular word embeddings available (e.g., Fasttext and GloVe); In short, those embeddings are a tool to encode words along with a sensible notion of semantics attached to those words (i.e. words with similar sematics are nearly parallel). Question: Is there a similar notion of character embedding? By 'character embedding' I understand an algorithm that allow us to encode characters in order to capture some syntactic similarity (i.e. similarity of character shapes or contexts).
Category: Data Science

How to train NER LSTM on single sentence level

My documents are only a single sentence long, containing one annotation. Sentences with the same named entity of course are similar, but not context-wise. NER training examples (afaik) always has documents sequentially related, aka the next document is context-wise related to the previous document. Consider the example below. The first sentence is about the US, with location annotations. The second sentence is about an organisation but still related to the previous. The United States of America (LOC), commonly known as …
Category: Data Science

A way to init sentence embedding for unsupervised text clustering, better than glove wordvec?

For unsupervised text clustering, the key thing is the init embedding for text. If we want to use deepcluster for text, the problem for text is how to get the init embedding from deep model. BERT can not get good init embedding. If we do not use deep model, is there better way to get embedding better than glove wordvec?
Category: Data Science

Word-level text generation with word embeddings – outputting a word vector instead of a probability distribution

I am currently researching the topic of text generation for my university project. I decided (ofc) to go with a RNN getting a sequence of tokens as input with a target of predicting the next token given the sequence. I have been reading through a number of tutorials and there is one thing that I am wondering about. The sources I have read, regardless of how they encode the X sequences (one-hot or word embeddings), encode the y target tokens …
Category: Data Science

How are the embedding and context matrices created and updated in word embedding?

I am struggling to understand how word embedding works, especially how the embedding matrix $W$ and context matrix $W'$ are created/updated. I understand that in the Input we may have a one-hot encoding of a given word, and that in the output we may have the word the most likely to be nearby this word $x_i$ Would you have any very simple mathematical example?
Category: Data Science

Why is Word2vec regarded as a neural embedding?

In the skip-gram model, the probability that a word $w$ is part of the set of context words $\{w_o^{(i)}\}$ $(i= 1:m)$ where $m$ is the context window around the central word, is given by: $$p(w_o | w_c) = \frac{\exp{(\vec{u_o}\cdot \vec{v_c)}}}{\sum_{i\in V}\exp{(\vec{u_i}\cdot \vec{v_c)}}} $$ where $V$ is the number of words in the training set, $\vec{u_i}$ is the word embedding for the context word and $\vec{v_i}$ is the word embedding for the central word. But this type of model is defining …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.