transformer

Temporal Fusion Transformer from PyTorch-Forecasting with Multiple Targets - 'list' error

John Herwig

2022年6月4日 19:22

New to PyTorch and the PyTorch Forecasting library and trying to predict multiple targets using the Temporal Fusion Transformer model. I have 7 targets in a list as my targets variable. I'm using MultiLoss as my loss function with a list of 7 CrossEntropy loss functions (1 per target variable) -- In the problem I'm trying to model, there are 7 possible outcomes per time step and I'm trying to find which option is most likely. I'm looking for a …

Topic: transformer forecasting pytorch lstm time-series

Category: Data Science

What enables transformers or very deep models "plan" ahead for sequential decision making?

Water Dragon

2022年6月4日 18:29

I was watching this amazing lecture by Oriol Vinyals. On one slide, there is a question asking if the very deep models plan. Transformer models or models employed in applications like Dialogue Generation do not have a planning component but behave like they already have the dialogue planned. Dr. Vinyals mentioned that there are papers on "how transformers are building up knowledge to answer questions or do all sorts of very interesting analyses". Can any please refer to a few …

Topic: transformer reinforcement-learning deep-learning neural-network machine-learning

Category: Data Science

Why is 10000 used as the denominator in Positional Encodings in the Transformer Model?

ThirtyOneTwentySeven

2022年6月3日 20:30

I was working through the All you need is Attention paper, and while the motivation of positional encodings makes sense and the other stackexchange answers filled me in on the motivations of the structure of it, I still don't understand why $1/10000$ was used as the scaling factor for the $pos$ of a word. Why was this number chosen?

Topic: transformer word-embeddings machine-learning

Category: Data Science

Self-Attention Summation and Loss of Information

Jozdien

2022年5月31日 17:03

In self-attention, the attention for a word is calculated as: $$ A(q, K, V) = \sum_{i} \frac{exp(q.k^{<i>})}{\sum_{j} exp(q.k^{<j>})}v^{<i>} $$ My question is why we sum over the Softmax*Value vectors. Doesn't this lose information about which other words in particular are important to the word under consideration? In other words, how does this summed vector point to which words are relevant? For example, consider two extreme scenarios where practically the entire output depends on the attention vector of word $x^{<t>}$, and …

Topic: transformer attention-mechanism information-theory deep-learning

Category: Data Science

How can I use the embedding generated by mBERT with a CNN or SVM as a classifier?

user18848025

2022年5月30日 13:10

I have a school project and need to use the embeddings generated by BERT, for example, mBERT, and using a classifier like SVM, CNN... Any help, please. Thank you!

Topic: bert transformer cnn word-embeddings nlp

Category: Data Science

Transformer time series classification using time2vec positional embedding

Reignbeaux

2022年5月30日 12:45

I want to use a transformer model to do classification of fixed-length time series. I was following along this tutorial using keras which uses time2vec as a positional embedding. According to the original time2vec paper the representation is calculated as $$ \boldsymbol{t2v}(\tau)[i] = \begin{cases} \omega_i \tau + \phi_i,& i = 0\\ F(\omega_i \tau + \phi_i), & 1 \leq i \leq k \end{cases} $$ The mentioned tutorial simply concatenates this embedding with the input. Now, I understand the intention of the …

Topic: transformer embeddings keras

Category: Data Science

Gradient and loss calculation localization in Vision Transformers

minos3579

2022年5月29日 16:49

Hi all I am resorting to you to figure out where the gradient and the loss for q,k,v weights update happens in Vision Transformers. I suspect it is the MLP/FF bit of the architecture but I am not confidently sure. I attach some code from lucidrains import torch from torch import nn from einops import rearrange, repeat from einops.layers.torch import Rearrange # helpers def pair(t): return t if isinstance(t, tuple) else (t, t) # classes class PreNorm(nn.Module): def __init__(self, dim, …

Topic: loss transformer gradient

Category: Data Science

mBART training "CUDA out of memory"

AFB

2022年5月29日 07:08

I want to train a network with mBART model in google colab , but I got the message of RuntimeError: CUDA out of memory. Tried to allocate 886.00 MiB (GPU 0; 15.90 GiB total capacity; 13.32 GiB already allocated; 809.75 MiB free; 14.30 GiB reserved in total by PyTorch) I subscribed with GPU in colab. I tried to use 128 or 64 for The maximum total input sequence length. Kindly, What can I do to fix the problem?

Topic: cuda transformer colab gpu

Category: Data Science

Large jumps in loss in simple transformer model?

msailor

2022年5月27日 21:40

As an exercise, I created a very simple transformer model that just sees the same simple batch of dummy data repeatedly and (one would assume) should quickly learn to fit it perfectly. And indeed, training reaches a loss of zero quickly. However I noticed that the loss does not stay at zero, or even close to it: there are occasional large jumps in the loss. The script below counts every time that the loss jumps by 10 or more between …

Topic: cross-entropy huggingface transformer loss-function deep-learning

Category: Data Science

Which model is better able to understand the difference that two sentences are talking about different things?

Ir8_mind

2022年5月26日 10:11

I'm currently working on the task of measuring semantic proximity between sentences. I use fasttext train _unsiupervised (skipgram) for this. I extract the sentence embeddings and then measure the cosine similarity between them. however, I ran into the following problem: cosine similarity between embeddings of these sentences: "Create a documentation of product A"; "he is creating a documentation of product B" is very high (>0.9). obviously it because both of them is about creating a documentation. but however the first …

Topic: semantic-similarity transformer word-embeddings deep-learning nlp

Category: Data Science

Could Attention_mask in T5 be a float in [0,1]?

Dave

2022年5月26日 07:35

I was inspecting T5 model from hf https://huggingface.co/docs/transformers/model_doc/t5 . attention_mask is presented as attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. I was wondering whether it could be used something "softer" not only selecting the not-padding token but also selecting "how much" attention should be used on every token. This question is …

Topic: huggingface transformer attention-mechanism deep-learning nlp

Category: Data Science

Embedding from Transformer-based model from paragraph or documnet (like Doc2Vec)

Bloodstone Programmer

2022年5月24日 18:04

I have a set of data that contains the different lengths of sequences. On average the sequence length is 600. The dataset is like this: S1 = ['Walk','Eat','Going school','Eat','Watching movie','Walk'......,'Sleep'] S2 = ['Eat','Eat','Going school','Walk','Walk','Watching movie'.......,'Eat'] ......................................... ......................................... S50 = ['Walk','Going school','Eat','Eat','Watching movie','Sleep',.......,'Walk'] The number of unique actions in the dataset are fixed. That means some sentences may not contain all of the actions. By using Doc2Vec (Gensim library particularly), I was able to extract embedding for each of the sequences …

Topic: doc2vec bert transformer embeddings nlp

Category: Data Science

What's the right input for gpt-2 in NLP

yuqiong11

2022年5月24日 10:59

I'm fine-tuning pre-trained gpt-2 for text summarization. The dataset contains 'text' and 'reference summary'. So my question is how to add special tokens to get the right input format. Currently I'm thinking doing like this: example1 <BOS> text <SEP> reference summary <EOS> , example2 <BOS> text <SEP> reference summary <EOS> , ..... Is this correct? If so, a follow-up question would be whether the max-token-length(i.e. 1024 for gpt-2) means also the concatenate length of text and reference summary? Any comment …

Topic: openai-gpt transformer data-science-model nlp

Category: Data Science

Why do Transformers need positional encodings?

Cole

2022年5月23日 20:03

At least in the first self-attention layer in the encoder, inputs have a correspondence with outputs, I have the following questions. Isn't ordering already implicitly captured by the query vectors, which themselves are just transformations of the inputs? What do the sinusoidal positional encodings capture that the ordering of the query vectors don't already do? Am I perhaps mistaken in thinking that transformers take in the entire input at once? How are words fed in? If we feed in the …

Topic: transformer deep-learning neural-network nlp machine-learning

Category: Data Science

Incorporating structural information in a Transformer?

Exploring

2022年5月18日 02:00

For a Neural Machine Translation (NMT) task, my input data has relational information. This relation could be modelled using a graphical structure. So one approach could be to use Graph Neural Network (GNN) and use a Graph2Seq model. But I can't find a good generational model for GNN. Instead, I want to use Transformer. But then the challenge is how can I embed structural information there? Is there any open source artefact for Relational Transformer that I can use out …

Topic: graph-neural-network transformer deep-learning machine-learning

Category: Data Science

ValueError: Mixed precision training with AMP or APEX (`--fp16` or `--bf16`) and half precision evaluation (`--fp16) can only be used on CUDA devices

ali hayen

2022年5月17日 08:24

i’m fine tuning the wav2vec-xlsr model. i’ve created a virtual env for that and i’ve installed cuda 11.0 and tensorflow-gpu==2.5.0 but it gives the following error : ValueError: Mixed precision training with AMP or APEX (--fp16 or --bf16) and half precision evaluation (--fp16_full_eval or --bf16_full_eval) can only be used on CUDA devices. i want to fine tune the model on GPU ANY HELP ?

Topic: cuda transformer finetuning gpu deep-learning

Category: Data Science

Pretrained vs. finetuned model

lazarea

2022年5月17日 07:15

I have a doubt regarding terminology. When dealing with huggingface transformer models, I often read about "using pretrained models for classification" vs. "fine-tuning a pretrained model for classification." I fail to understand what the exact difference between these two is. As I understand, pretrained models by themselves cannot be used for classification, regression, or any relevant task, without attaching at least one more dense layer and one more output layer, and then training the model. In this case, we would …

Topic: pretraining transformer finetuning transfer-learning

Category: Data Science

How to optimize hyperparameters in Bert?

PicaR

2022年5月13日 15:27

I am using the BERT model in order to classify stereotypes in sentences. I wanted to know if there is a way to automate the optimization of hyperparameters such as 'epochs', 'batchs' or 'learning rate' with some function that is similar to 'GridSearchCV' (I don't know if this function can be used in the BERT model, if it can be used let me know) so I don't have to test combinations of values 'by hand'. I attach part of my …

Topic: bert transformer deep-learning nlp python

Category: Data Science

Can i use Transformer-XL for text classification task?

Dat Le

2022年5月12日 10:12

I want to use transformer xl for text classification tasks. But I don't know the architect model for the text classification task. I use dense layers with activation softmax for logits output from the transformer xl model, but this doesn't seem right. when training I see accuracy is very low. Output of my model: My training step:

Topic: transformer text-classification tensorflow deep-learning nlp

Category: Data Science

BERT base uncased required gpu ram

Gius

2022年5月12日 08:58

I'm working on an NLP task, using BERT, and I have a little doubt about GPU memory. I already made a model (using DistilBERT) since I had out-of-memory problems with tensorflow on a RTX3090 (24gb gpu's ram, but ~20.5gb usable) with BERT base model. To make it working, I limited my data to 1.1 milion of sentences in training set (truncating sentences at 128 words), and like 300k in validation, but using an high batch size (256). Now I have …

Topic: bert transformer gpu

Category: Data Science

About