I was working through the All you need is Attention paper, and while the motivation of positional encodings makes sense and the other stackexchange answers filled me in on the motivations of the structure of it, I still don't understand why $1/10000$ was used as the scaling factor for the $pos$ of a word. Why was this number chosen?
I am beginner in machine learning. My project is to make search engine based on AI which shows related articles when we search on website. For this i decided to train my own embedding. I found two methods for this: One is to train network to find next word( i.e inputs=[the quick,the quick brown,the quick brown fox] and outputs=[brown, fox,lazy] Other method is to train with nearest words(i.e [brown,fox],[brown,quick],[brown,quick]). Which method should i use and after training how should i …
I want to fine-tune BERT by training it on a domain dataset of my own. The domain is specific and includes many terms that probably weren't included in the original dataset BERT was trained on. I know I have to use BERT's tokenizer as the model was originally trained on its embeddings. To my understanding words unknown to the tokenizer will be masked with [UNKNOWN]. What if some of these words are common in my dataset? Does it make sense …
How can words be clustered into groups of similar meaning (synonyms)? I started with pre-trained word embeddings (e.g., Google News), which is great, but not perfect - a limitation arises because the word embeddings are based on surrounding words. This introduces challenging results. For example: polar meanings: word embeddings might find opposites to be similar. Even though these words mean the opposite semantically, they can quite readily be interchanged given the same preceding and following words. For example, "terrible" and …
I have a school project and need to use the embeddings generated by BERT, for example, mBERT, and using a classifier like SVM, CNN... Any help, please. Thank you!
I'm currently trying to find a way of loading/deserializing a .json file containing Flair word embeddings that is too large to fit in my RAM at once (>60GB .json with 32GB of RAM). My current code for loading the embedding is below. def get_embedding_table(config): words_id2vec = json.load(open(config.words_id2vector_filename, 'r')) words_vectors = [0] * len(words_id2vec) for id, vec in words_id2vec.items(): words_vectors[int(id)] = vec words_vectors.append(list(np.random.uniform(0, 1, config.embedding_dim))) words_embedding_table = tf.Variable(name='words_emb_table', initial_value=words_vectors, dtype=tf.float32) The rest of the code that I am trying to reproduce …
I am trying to train a doc2vec based on user browsing history (urls tagged to user_id). I use chainer deep learning framework. There are more than 20 millions (user_id and urls) of embeddings to initialize which doesn’t fit in a GPU internal memory (maximum available 12 GB). Training on CPU is very slow. I am giving an attempt using code written in chainer given here Please advise options to try if any.
I'm currently working on the task of measuring semantic proximity between sentences. I use fasttext train _unsiupervised (skipgram) for this. I extract the sentence embeddings and then measure the cosine similarity between them. however, I ran into the following problem: cosine similarity between embeddings of these sentences: "Create a documentation of product A"; "he is creating a documentation of product B" is very high (>0.9). obviously it because both of them is about creating a documentation. but however the first …
I'm using this Universal Sentence Encoder (USE) model to get embeddings of a set of texts, each text corresponding to a newspaper article. In order to build a Recommender System, I generate user embeddings by averaging the embeddings of items a user has read, and then I look for other texts that are cosine-similar to this user (basically, the method returns a set of items that are similar to this user embedding). Now, the problem is that the mentioned model …
In two-layer perceptrons that slide across words of text, such as word2vec and fastText, hidden layer heights may be a product of two random variables such as positional embeddings and word embeddings (Mikolov et al. 2017, Section 2.2): $$v_c = \sum_{p\in P} d_p \odot u_{t+p}$$ However, it's unclear to me how to best initialize the two variables. When only word embeddings are used for the hidden layer weights, word2vec and fastText initialize them to $\mathcal{U}(-1 / \text{fan_out}; 1 / \text{fan_out})$. …
I have scientific database with articles and coauthors. using this database I am training word2vec model on co-authors. Use use case here is to disambiguate authors. I was wondering my approach here can be improved or any suggestions will greatly be appreciated. Code
So I am new to Deep Learning and NLP. I have read several blog posts on medium, towardsdatascience and papers where they talk about pre-training the word embeddings in an unsupervised fashion and then use them in supervised DNN. But recently I read a blog post which suggested that training the word embeddings while training the neural network gives better results. This is the other link. So my question is which one should I follow? Some YouTube videos that I …
I would like to create word embeddings that take context into account, so the vector of the word Jaguar [animal] would be different from the word Jaguar [car brand]. As you know, word2vec only gives one representation for a given word, and I would like to take already pretrained embeddings and enrich them with context. So far I've tried a simple way with taking an average vector of the word and category word, for example like this. Now I would …
There are several popular word embeddings available (e.g., Fasttext and GloVe); In short, those embeddings are a tool to encode words along with a sensible notion of semantics attached to those words (i.e. words with similar sematics are nearly parallel). Question: Is there a similar notion of character embedding? By 'character embedding' I understand an algorithm that allow us to encode characters in order to capture some syntactic similarity (i.e. similarity of character shapes or contexts).
My documents are only a single sentence long, containing one annotation. Sentences with the same named entity of course are similar, but not context-wise. NER training examples (afaik) always has documents sequentially related, aka the next document is context-wise related to the previous document. Consider the example below. The first sentence is about the US, with location annotations. The second sentence is about an organisation but still related to the previous. The United States of America (LOC), commonly known as …
For unsupervised text clustering, the key thing is the init embedding for text. If we want to use deepcluster for text, the problem for text is how to get the init embedding from deep model. BERT can not get good init embedding. If we do not use deep model, is there better way to get embedding better than glove wordvec?
I am currently researching the topic of text generation for my university project. I decided (ofc) to go with a RNN getting a sequence of tokens as input with a target of predicting the next token given the sequence. I have been reading through a number of tutorials and there is one thing that I am wondering about. The sources I have read, regardless of how they encode the X sequences (one-hot or word embeddings), encode the y target tokens …
I am struggling to understand how word embedding works, especially how the embedding matrix $W$ and context matrix $W'$ are created/updated. I understand that in the Input we may have a one-hot encoding of a given word, and that in the output we may have the word the most likely to be nearby this word $x_i$ Would you have any very simple mathematical example?
In the skip-gram model, the probability that a word $w$ is part of the set of context words $\{w_o^{(i)}\}$ $(i= 1:m)$ where $m$ is the context window around the central word, is given by: $$p(w_o | w_c) = \frac{\exp{(\vec{u_o}\cdot \vec{v_c)}}}{\sum_{i\in V}\exp{(\vec{u_i}\cdot \vec{v_c)}}} $$ where $V$ is the number of words in the training set, $\vec{u_i}$ is the word embedding for the context word and $\vec{v_i}$ is the word embedding for the central word. But this type of model is defining …