Temporal Fusion Transformer from PyTorch-Forecasting with Multiple Targets - 'list' error

New to PyTorch and the PyTorch Forecasting library and trying to predict multiple targets using the Temporal Fusion Transformer model. I have 7 targets in a list as my targets variable. I'm using MultiLoss as my loss function with a list of 7 CrossEntropy loss functions (1 per target variable) -- In the problem I'm trying to model, there are 7 possible outcomes per time step and I'm trying to find which option is most likely. I'm looking for a …
Category: Data Science

What enables transformers or very deep models "plan" ahead for sequential decision making?

I was watching this amazing lecture by Oriol Vinyals. On one slide, there is a question asking if the very deep models plan. Transformer models or models employed in applications like Dialogue Generation do not have a planning component but behave like they already have the dialogue planned. Dr. Vinyals mentioned that there are papers on "how transformers are building up knowledge to answer questions or do all sorts of very interesting analyses". Can any please refer to a few …
Category: Data Science

Why is 10000 used as the denominator in Positional Encodings in the Transformer Model?

I was working through the All you need is Attention paper, and while the motivation of positional encodings makes sense and the other stackexchange answers filled me in on the motivations of the structure of it, I still don't understand why $1/10000$ was used as the scaling factor for the $pos$ of a word. Why was this number chosen?
Category: Data Science

Self-Attention Summation and Loss of Information

In self-attention, the attention for a word is calculated as: $$ A(q, K, V) = \sum_{i} \frac{exp(q.k^{<i>})}{\sum_{j} exp(q.k^{<j>})}v^{<i>} $$ My question is why we sum over the Softmax*Value vectors. Doesn't this lose information about which other words in particular are important to the word under consideration? In other words, how does this summed vector point to which words are relevant? For example, consider two extreme scenarios where practically the entire output depends on the attention vector of word $x^{<t>}$, and …
Category: Data Science

Transformer time series classification using time2vec positional embedding

I want to use a transformer model to do classification of fixed-length time series. I was following along this tutorial using keras which uses time2vec as a positional embedding. According to the original time2vec paper the representation is calculated as $$ \boldsymbol{t2v}(\tau)[i] = \begin{cases} \omega_i \tau + \phi_i,& i = 0\\ F(\omega_i \tau + \phi_i), & 1 \leq i \leq k \end{cases} $$ The mentioned tutorial simply concatenates this embedding with the input. Now, I understand the intention of the …
Category: Data Science

Gradient and loss calculation localization in Vision Transformers

Hi all I am resorting to you to figure out where the gradient and the loss for q,k,v weights update happens in Vision Transformers. I suspect it is the MLP/FF bit of the architecture but I am not confidently sure. I attach some code from lucidrains import torch from torch import nn from einops import rearrange, repeat from einops.layers.torch import Rearrange # helpers def pair(t): return t if isinstance(t, tuple) else (t, t) # classes class PreNorm(nn.Module): def __init__(self, dim, …
Category: Data Science

mBART training "CUDA out of memory"

I want to train a network with mBART model in google colab , but I got the message of RuntimeError: CUDA out of memory. Tried to allocate 886.00 MiB (GPU 0; 15.90 GiB total capacity; 13.32 GiB already allocated; 809.75 MiB free; 14.30 GiB reserved in total by PyTorch) I subscribed with GPU in colab. I tried to use 128 or 64 for The maximum total input sequence length. Kindly, What can I do to fix the problem?
Category: Data Science

Large jumps in loss in simple transformer model?

As an exercise, I created a very simple transformer model that just sees the same simple batch of dummy data repeatedly and (one would assume) should quickly learn to fit it perfectly. And indeed, training reaches a loss of zero quickly. However I noticed that the loss does not stay at zero, or even close to it: there are occasional large jumps in the loss. The script below counts every time that the loss jumps by 10 or more between …
Category: Data Science

Which model is better able to understand the difference that two sentences are talking about different things?

I'm currently working on the task of measuring semantic proximity between sentences. I use fasttext train _unsiupervised (skipgram) for this. I extract the sentence embeddings and then measure the cosine similarity between them. however, I ran into the following problem: cosine similarity between embeddings of these sentences: "Create a documentation of product A"; "he is creating a documentation of product B" is very high (>0.9). obviously it because both of them is about creating a documentation. but however the first …
Category: Data Science

Could Attention_mask in T5 be a float in [0,1]?

I was inspecting T5 model from hf https://huggingface.co/docs/transformers/model_doc/t5 . attention_mask is presented as attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. I was wondering whether it could be used something "softer" not only selecting the not-padding token but also selecting "how much" attention should be used on every token. This question is …
Category: Data Science

Embedding from Transformer-based model from paragraph or documnet (like Doc2Vec)

I have a set of data that contains the different lengths of sequences. On average the sequence length is 600. The dataset is like this: S1 = ['Walk','Eat','Going school','Eat','Watching movie','Walk'......,'Sleep'] S2 = ['Eat','Eat','Going school','Walk','Walk','Watching movie'.......,'Eat'] ......................................... ......................................... S50 = ['Walk','Going school','Eat','Eat','Watching movie','Sleep',.......,'Walk'] The number of unique actions in the dataset are fixed. That means some sentences may not contain all of the actions. By using Doc2Vec (Gensim library particularly), I was able to extract embedding for each of the sequences …
Category: Data Science

What's the right input for gpt-2 in NLP

I'm fine-tuning pre-trained gpt-2 for text summarization. The dataset contains 'text' and 'reference summary'. So my question is how to add special tokens to get the right input format. Currently I'm thinking doing like this: example1 <BOS> text <SEP> reference summary <EOS> , example2 <BOS> text <SEP> reference summary <EOS> , ..... Is this correct? If so, a follow-up question would be whether the max-token-length(i.e. 1024 for gpt-2) means also the concatenate length of text and reference summary? Any comment …
Category: Data Science

Why do Transformers need positional encodings?

At least in the first self-attention layer in the encoder, inputs have a correspondence with outputs, I have the following questions. Isn't ordering already implicitly captured by the query vectors, which themselves are just transformations of the inputs? What do the sinusoidal positional encodings capture that the ordering of the query vectors don't already do? Am I perhaps mistaken in thinking that transformers take in the entire input at once? How are words fed in? If we feed in the …
Category: Data Science

Incorporating structural information in a Transformer?

For a Neural Machine Translation (NMT) task, my input data has relational information. This relation could be modelled using a graphical structure. So one approach could be to use Graph Neural Network (GNN) and use a Graph2Seq model. But I can't find a good generational model for GNN. Instead, I want to use Transformer. But then the challenge is how can I embed structural information there? Is there any open source artefact for Relational Transformer that I can use out …
Category: Data Science

ValueError: Mixed precision training with AMP or APEX (`--fp16` or `--bf16`) and half precision evaluation (`--fp16) can only be used on CUDA devices

i’m fine tuning the wav2vec-xlsr model. i’ve created a virtual env for that and i’ve installed cuda 11.0 and tensorflow-gpu==2.5.0 but it gives the following error : ValueError: Mixed precision training with AMP or APEX (--fp16 or --bf16) and half precision evaluation (--fp16_full_eval or --bf16_full_eval) can only be used on CUDA devices. i want to fine tune the model on GPU ANY HELP ?
Category: Data Science

Pretrained vs. finetuned model

I have a doubt regarding terminology. When dealing with huggingface transformer models, I often read about "using pretrained models for classification" vs. "fine-tuning a pretrained model for classification." I fail to understand what the exact difference between these two is. As I understand, pretrained models by themselves cannot be used for classification, regression, or any relevant task, without attaching at least one more dense layer and one more output layer, and then training the model. In this case, we would …
Category: Data Science

How to optimize hyperparameters in Bert?

I am using the BERT model in order to classify stereotypes in sentences. I wanted to know if there is a way to automate the optimization of hyperparameters such as 'epochs', 'batchs' or 'learning rate' with some function that is similar to 'GridSearchCV' (I don't know if this function can be used in the BERT model, if it can be used let me know) so I don't have to test combinations of values 'by hand'. I attach part of my …
Category: Data Science

Can i use Transformer-XL for text classification task?

I want to use transformer xl for text classification tasks. But I don't know the architect model for the text classification task. I use dense layers with activation softmax for logits output from the transformer xl model, but this doesn't seem right. when training I see accuracy is very low. Output of my model: My training step:
Category: Data Science

BERT base uncased required gpu ram

I'm working on an NLP task, using BERT, and I have a little doubt about GPU memory. I already made a model (using DistilBERT) since I had out-of-memory problems with tensorflow on a RTX3090 (24gb gpu's ram, but ~20.5gb usable) with BERT base model. To make it working, I limited my data to 1.1 milion of sentences in training set (truncating sentences at 128 words), and like 300k in validation, but using an high batch size (256). Now I have …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.