attention-mechanism

How to add the Luong Attention Mechanism into CNN?

xniwniw

2022年6月4日 21:17

As I write my CNN model for an image binary classification below, I'm trying to add an attention layer to this model. I read from tf.keras.layers.Attention: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention But I still don't know exactly how to use it, any help is appreciated. model = keras.Sequential() model.add(Conv2D(filters = 64, kernel_size = (3, 3), activation = 'relu', padding='same', input_shape = ((256,256,3)))) model.add(MaxPooling2D(pool_size = (2, 2), strides=(2, 2))) model.add(Conv2D(filters = 128, kernel_size = (3, 3), activation = 'relu', padding='same')) model.add(MaxPooling2D(pool_size = (2, 2), strides=(2, …

Topic: attention-mechanism keras convolutional-neural-network

Category: Data Science

Class token in ViT and BERT

Shir

2022年6月4日 15:02

I'm trying to understand the architecture of the ViT Paper, and noticed they use a CLASS token like in BERT. To the best of my understanding this token is used to gather knowledge of the entire class, and is then solely used to predict the class of the image. My question is — why does this token exist as input in all the transformer blocks and is treated the same as the word / patches tokens? Treating the class token …

Topic: attention-mechanism computer-vision deep-learning nlp machine-learning

Category: Data Science

Is a dense layer required for implementing Bahdanau attention?

Lucky Man

2022年6月4日 02:00

I saw that everyone adds Dense( ) layer in their custom Bahdanau attention layer, which I think isn't needed. This is an image from a tutorial here. Here, we are just multiplying 2 vectors and then doing several operations on these vectors only. So what is the need of Dense( ) layer. Is the tutorial on 'how does attention work' wrong?

Topic: attention-mechanism deep-learning machine-learning

Category: Data Science

Self-Attention Summation and Loss of Information

Jozdien

2022年5月31日 17:03

In self-attention, the attention for a word is calculated as: $$ A(q, K, V) = \sum_{i} \frac{exp(q.k^{<i>})}{\sum_{j} exp(q.k^{<j>})}v^{<i>} $$ My question is why we sum over the Softmax*Value vectors. Doesn't this lose information about which other words in particular are important to the word under consideration? In other words, how does this summed vector point to which words are relevant? For example, consider two extreme scenarios where practically the entire output depends on the attention vector of word $x^{<t>}$, and …

Topic: transformer attention-mechanism information-theory deep-learning

Category: Data Science

What is a in intuitive explanation of attention and self attention mechanisms?

Samuel Nihoul

2022年5月31日 06:32

...that goes into sufficient technical details. I could not find any great resource out there. I think an infographics in this fashion: could come a long way, if that makes sense here. Credit to Michael Phi for the infographics.

Topic: attention-mechanism machine-learning

Category: Data Science

Working Behavior of BERT vs Transformers vs Self-Attention+LSTM vs Attention+LSTM on the scientific STEM data classification task?

Deshwal

2022年5月29日 16:09

So I just used BERT pre-trained with Focal Loss to classify Physics, Chemistry, Biology and Mathematics and got a good f-1 macro of 0.91. It is good given it only had to look for the tokens like triangle, reaction, mitochondria and newton etc in a broader way. Now I want to classify the the Chapter Name also. It is a bit difficult task because when I trained it on BERT for 208 classes, my score was almost 0. Why? I …

Topic: attention-mechanism lstm deep-learning nlp machine-learning

Category: Data Science

Could Attention_mask in T5 be a float in [0,1]?

Dave

2022年5月26日 07:35

I was inspecting T5 model from hf https://huggingface.co/docs/transformers/model_doc/t5 . attention_mask is presented as attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. I was wondering whether it could be used something "softer" not only selecting the not-padding token but also selecting "how much" attention should be used on every token. This question is …

Topic: huggingface transformer attention-mechanism deep-learning nlp

Category: Data Science

Can the attention mask hold values between 0 and 1?

neel g

2022年5月25日 18:17

I am new to attention-based models and wanted to understand more about the attention mask in NLP models. attention_mask: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. It's the mask that we typically use for attention when a batch has varying length sentences. So a normal attention mask is supposed to look …

Topic: imbalanced-data attention-mechanism nlp machine-learning

Category: Data Science

two different attention methods for seq2seq

DSKim

2022年5月22日 02:00

I see two different ways of applying attention in seq2seq: (a) the context vector (the weighted sum of encoder hidden states) fed into the output softmax, as shown in the diagram below. The diagram is from here. (b) the context vector fed into the decoder input as shown the diagram below. The diagram is from here. What are the pros and the cons of the two different approaches? Is there any paper comparing the two?

Topic: attention-mechanism sequence-to-sequence

Category: Data Science

How does attention for feature fusion works

JoongKi

2022年5月12日 03:58

I am struggling to understand how would a self-attention layer be used for features of different modalities fusion. What I understand until now is that : Every unique modality is fed into a self-attention layer, this layer produces attention scores for every feature of that modality. So these scores give us information about which features are most important in that modality. And then I have read that using these scores we can find out which of the modalities is most …

Topic: attention-mechanism deep-learning neural-network

Category: Data Science

How to add attention mechanism to my sequence-to-sequence architecture in Keras?

Fredrik

2022年5月10日 13:02

Based on this blog entry, I have written a sequence to sequence deep learning model in Keras: model = Sequential() model.add(LSTM(hidden_nodes, input_shape=(n_timesteps, n_features))) model.add(RepeatVector(n_timesteps)) model.add(LSTM(hidden_nodes, return_sequences=True)) model.add(TimeDistributed(Dense(n_features, activation='softmax'))) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, Y_train, epochs=30, batch_size=32) It works reasonably well, but I intend to improve it by applying attention mechanism. The aforementioned blog post includes a variation of the architecture with it by relying on a custom attention code, but it doesn't work my present TensorFlow/Keras versions, and anyway, to my …

Topic: attention-mechanism sequence-to-sequence lstm keras

Category: Data Science

Custom Simulator for Deep Reinforcement Learning

Esmaeel Mohammadi

2022年5月10日 12:30

I am trying to develop a control method for a specific process in industry. I have a time-series of data for the process and want to develop a prediction model base on attention mechanism to estimate the output of the system. After development of the prediction model, I want to design a controller based on Deep Reinforcement Learning to learn policies for process optimization. But I need a simulated environment to test and train my DRL algorithm on it. How …

Topic: attention-mechanism lstm reinforcement-learning

Category: Data Science

Attention network without hidden state?

JMRC

2022年5月8日 04:07

I was wondering how useful the encoder's hidden state is for an attention network. When I looked into the structure of an attention model, this is what I found a model generally looks like: x: Input. h: Encoder's hidden state which feeds forward to the next encoder's hidden state. s: Decoder's hidden state which has a weighted sum of all the encoder's hidden states as input and feeds forward to the next decoder's hidden state. y: Output. With a process …

Topic: attention-mechanism rnn machine-translation machine-learning

Category: Data Science

Is the number of bidirectional LSTMs in encoder-decoder model equal to the maximum length of input text/characters?

Joe Black

2022年4月26日 07:03

I'm confused about this aspect of RNNs while trying to learn how seq2seq encoder-decoder works at https://machinelearningmastery.com/configure-encoder-decoder-model-neural-machine-translation/. It seems to me that the number of LSTMs in the encoder would have to be the same as number of words in the text (if word embeddings are being used) or characters in the text (if char embeddings are being used). For char embeddings, each embedding would correspond to 1 LSTM in 1 direction and 1 encoder hidden state. Is this understanding …

Topic: attention-mechanism lstm rnn word-embeddings nlp

Category: Data Science

Attention model with seq2seq over sequence

Patricio

2022年4月23日 20:06

On the official tensorflow page there is one exmple of a decoder (https://www.tensorflow.org/tutorials/text/nmt_with_attention#next_steps): class Decoder(tf.keras.Model): def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz): super(Decoder, self).__init__() self.batch_sz = batch_sz self.dec_units = dec_units self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim) self.gru = tf.keras.layers.GRU(self.dec_units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform') self.fc = tf.keras.layers.Dense(vocab_size) # used for attention self.attention = BahdanauAttention(self.dec_units) def call(self, x, hidden, enc_output): # enc_output shape == (batch_size, max_length, hidden_size) context_vector, attention_weights = self.attention(hidden, enc_output) # x shape after passing through embedding == (batch_size, 1, embedding_dim) x = self.embedding(x) …

Topic: attention-mechanism lstm keras tensorflow

Category: Data Science

Why does Keras only have 3 types of attention layers?

Sandeep Bhutani

2022年4月22日 18:04

The Keras library list only has 3 types of attentions - keras attention layers, which are : MultiHeadAttention layer Attention layer AdditiveAttention layer However, in theory there are multiple types of attentions possible, e.g. (some of these may be synonyms): Global Local Hard Bahdanau Attention Luong Attention self additive Latent what else? Are other types just not practical or other types actually can be derived from existing implementation? Can someone please shed some light with examples?

Topic: attention-mechanism keras deep-learning

Category: Data Science

Understanding Transformer's Self attention calculations

MAC

2022年4月22日 01:08

I was going through this link: https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/?utm_source=blog&utm_medium=demystifying-bert-groundbreaking-nlp-framework#comment-160771 What is the value of Key, Value in the self attention calculation of Transformer model? Query vector is embedding vector for the word that is queried, is that right? Is attention calculated in RNN is different from self attention in Transformer?

Topic: transformer attention-mechanism

Category: Data Science

How to Visualize attention weights in a Attention based Encoder-Decoder network in Time series forecasting

debaonline4u

2022年4月13日 06:48

Below is one example Attention-based Encoder-decoder network for multivariate time series forecasting task. I want to visualize the attention weights. input_ = Input(shape=(TIME_STEPS,N)) x = attention_block(input_) x = LSTM(512, return_sequences=True)(x) x = LSTM(512)(x) x = RepeatVector(n_future)(x) x = LSTM(128, activation='relu', return_sequences=True)(x) x = TimeDistributed(Dense(128, activation='relu'))(x) x = Dense(1)(x) model = Model(input_,x) model.compile(loss="mean_squared_error",optimizer="adam",metrics=["acc"]) print(model.summary()) Here is the implementation of my attention block: def attention_block(inputs): x=Permute((2,1))(inputs) x=Dense(TIME_STEPS,activation="softmax")(x) x=Permute((2,1),name="attention_prob")(x) x=multiply([inputs,x]) return x I will highly appreciate if a fresh implementation of the attention …

Topic: encoder attention-mechanism forecasting deep-learning time-series

Category: Data Science

Do the multiple heads in Multi head attention actually lead to more parameters or different outputs?

Aushilfsgott

2022年4月6日 16:31

I am trying to understand Transformers. While I understand the concept of the encoder-decoder structure and the idea behind self-attention what I am stuck at is the "multi head part" of the "MultiheadAttention-Layer". Looking at this explanation https://jalammar.github.io/illustrated-transformer/, which I generally found very good, it appears that multiple weight matrices (one set of weight matrices per head) are used to transform the original input value into the query, key and value, which are then used to calculate the attention scores …

Topic: transformer attention-mechanism pytorch

Category: Data Science

Attention to multiple areas of same sentence

Sandeep Bhutani

2022年4月5日 14:06

Lets consider some sentences below: "Datascience exchange is a wonderful platform to get answers to datascience related queries and it helps to learn various concepts too" "Can company1 buy company2? What will be their total turnover then?" "Coronavirus was originated in china. After that it is spreading all over the world. To prevent it everyone has to take care of cleanliness and prefer vegetarians." In all above sentences you can see there are multiple questions or utternaces. Sometimes separated by …

Topic: attention-mechanism nlp

Category: Data Science

About