How to add the Luong Attention Mechanism into CNN?

As I write my CNN model for an image binary classification below, I'm trying to add an attention layer to this model. I read from tf.keras.layers.Attention: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention But I still don't know exactly how to use it, any help is appreciated. model = keras.Sequential() model.add(Conv2D(filters = 64, kernel_size = (3, 3), activation = 'relu', padding='same', input_shape = ((256,256,3)))) model.add(MaxPooling2D(pool_size = (2, 2), strides=(2, 2))) model.add(Conv2D(filters = 128, kernel_size = (3, 3), activation = 'relu', padding='same')) model.add(MaxPooling2D(pool_size = (2, 2), strides=(2, …
Category: Data Science

Class token in ViT and BERT

I'm trying to understand the architecture of the ViT Paper, and noticed they use a CLASS token like in BERT. To the best of my understanding this token is used to gather knowledge of the entire class, and is then solely used to predict the class of the image. My question is — why does this token exist as input in all the transformer blocks and is treated the same as the word / patches tokens? Treating the class token …
Category: Data Science

Is a dense layer required for implementing Bahdanau attention?

I saw that everyone adds Dense( ) layer in their custom Bahdanau attention layer, which I think isn't needed. This is an image from a tutorial here. Here, we are just multiplying 2 vectors and then doing several operations on these vectors only. So what is the need of Dense( ) layer. Is the tutorial on 'how does attention work' wrong?
Category: Data Science

Self-Attention Summation and Loss of Information

In self-attention, the attention for a word is calculated as: $$ A(q, K, V) = \sum_{i} \frac{exp(q.k^{<i>})}{\sum_{j} exp(q.k^{<j>})}v^{<i>} $$ My question is why we sum over the Softmax*Value vectors. Doesn't this lose information about which other words in particular are important to the word under consideration? In other words, how does this summed vector point to which words are relevant? For example, consider two extreme scenarios where practically the entire output depends on the attention vector of word $x^{<t>}$, and …
Category: Data Science

Working Behavior of BERT vs Transformers vs Self-Attention+LSTM vs Attention+LSTM on the scientific STEM data classification task?

So I just used BERT pre-trained with Focal Loss to classify Physics, Chemistry, Biology and Mathematics and got a good f-1 macro of 0.91. It is good given it only had to look for the tokens like triangle, reaction, mitochondria and newton etc in a broader way. Now I want to classify the the Chapter Name also. It is a bit difficult task because when I trained it on BERT for 208 classes, my score was almost 0. Why? I …
Category: Data Science

Could Attention_mask in T5 be a float in [0,1]?

I was inspecting T5 model from hf https://huggingface.co/docs/transformers/model_doc/t5 . attention_mask is presented as attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. I was wondering whether it could be used something "softer" not only selecting the not-padding token but also selecting "how much" attention should be used on every token. This question is …
Category: Data Science

Can the attention mask hold values between 0 and 1?

I am new to attention-based models and wanted to understand more about the attention mask in NLP models. attention_mask: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. It's the mask that we typically use for attention when a batch has varying length sentences. So a normal attention mask is supposed to look …
Category: Data Science

two different attention methods for seq2seq

I see two different ways of applying attention in seq2seq: (a) the context vector (the weighted sum of encoder hidden states) fed into the output softmax, as shown in the diagram below. The diagram is from here. (b) the context vector fed into the decoder input as shown the diagram below. The diagram is from here. What are the pros and the cons of the two different approaches? Is there any paper comparing the two?
Category: Data Science

How does attention for feature fusion works

I am struggling to understand how would a self-attention layer be used for features of different modalities fusion. What I understand until now is that : Every unique modality is fed into a self-attention layer, this layer produces attention scores for every feature of that modality. So these scores give us information about which features are most important in that modality. And then I have read that using these scores we can find out which of the modalities is most …
Category: Data Science

How to add attention mechanism to my sequence-to-sequence architecture in Keras?

Based on this blog entry, I have written a sequence to sequence deep learning model in Keras: model = Sequential() model.add(LSTM(hidden_nodes, input_shape=(n_timesteps, n_features))) model.add(RepeatVector(n_timesteps)) model.add(LSTM(hidden_nodes, return_sequences=True)) model.add(TimeDistributed(Dense(n_features, activation='softmax'))) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, Y_train, epochs=30, batch_size=32) It works reasonably well, but I intend to improve it by applying attention mechanism. The aforementioned blog post includes a variation of the architecture with it by relying on a custom attention code, but it doesn't work my present TensorFlow/Keras versions, and anyway, to my …
Category: Data Science

Custom Simulator for Deep Reinforcement Learning

I am trying to develop a control method for a specific process in industry. I have a time-series of data for the process and want to develop a prediction model base on attention mechanism to estimate the output of the system. After development of the prediction model, I want to design a controller based on Deep Reinforcement Learning to learn policies for process optimization. But I need a simulated environment to test and train my DRL algorithm on it. How …
Category: Data Science

Attention network without hidden state?

I was wondering how useful the encoder's hidden state is for an attention network. When I looked into the structure of an attention model, this is what I found a model generally looks like: x: Input. h: Encoder's hidden state which feeds forward to the next encoder's hidden state. s: Decoder's hidden state which has a weighted sum of all the encoder's hidden states as input and feeds forward to the next decoder's hidden state. y: Output. With a process …
Category: Data Science

Is the number of bidirectional LSTMs in encoder-decoder model equal to the maximum length of input text/characters?

I'm confused about this aspect of RNNs while trying to learn how seq2seq encoder-decoder works at https://machinelearningmastery.com/configure-encoder-decoder-model-neural-machine-translation/. It seems to me that the number of LSTMs in the encoder would have to be the same as number of words in the text (if word embeddings are being used) or characters in the text (if char embeddings are being used). For char embeddings, each embedding would correspond to 1 LSTM in 1 direction and 1 encoder hidden state. Is this understanding …
Category: Data Science

Attention model with seq2seq over sequence

On the official tensorflow page there is one exmple of a decoder (https://www.tensorflow.org/tutorials/text/nmt_with_attention#next_steps): class Decoder(tf.keras.Model): def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz): super(Decoder, self).__init__() self.batch_sz = batch_sz self.dec_units = dec_units self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim) self.gru = tf.keras.layers.GRU(self.dec_units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform') self.fc = tf.keras.layers.Dense(vocab_size) # used for attention self.attention = BahdanauAttention(self.dec_units) def call(self, x, hidden, enc_output): # enc_output shape == (batch_size, max_length, hidden_size) context_vector, attention_weights = self.attention(hidden, enc_output) # x shape after passing through embedding == (batch_size, 1, embedding_dim) x = self.embedding(x) …
Category: Data Science

Why does Keras only have 3 types of attention layers?

The Keras library list only has 3 types of attentions - keras attention layers, which are : MultiHeadAttention layer Attention layer AdditiveAttention layer However, in theory there are multiple types of attentions possible, e.g. (some of these may be synonyms): Global Local Hard Bahdanau Attention Luong Attention self additive Latent what else? Are other types just not practical or other types actually can be derived from existing implementation? Can someone please shed some light with examples?
Category: Data Science

Understanding Transformer's Self attention calculations

I was going through this link: https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/?utm_source=blog&utm_medium=demystifying-bert-groundbreaking-nlp-framework#comment-160771 What is the value of Key, Value in the self attention calculation of Transformer model? Query vector is embedding vector for the word that is queried, is that right? Is attention calculated in RNN is different from self attention in Transformer?
Category: Data Science

How to Visualize attention weights in a Attention based Encoder-Decoder network in Time series forecasting

Below is one example Attention-based Encoder-decoder network for multivariate time series forecasting task. I want to visualize the attention weights. input_ = Input(shape=(TIME_STEPS,N)) x = attention_block(input_) x = LSTM(512, return_sequences=True)(x) x = LSTM(512)(x) x = RepeatVector(n_future)(x) x = LSTM(128, activation='relu', return_sequences=True)(x) x = TimeDistributed(Dense(128, activation='relu'))(x) x = Dense(1)(x) model = Model(input_,x) model.compile(loss="mean_squared_error",optimizer="adam",metrics=["acc"]) print(model.summary()) Here is the implementation of my attention block: def attention_block(inputs): x=Permute((2,1))(inputs) x=Dense(TIME_STEPS,activation="softmax")(x) x=Permute((2,1),name="attention_prob")(x) x=multiply([inputs,x]) return x I will highly appreciate if a fresh implementation of the attention …
Category: Data Science

Do the multiple heads in Multi head attention actually lead to more parameters or different outputs?

I am trying to understand Transformers. While I understand the concept of the encoder-decoder structure and the idea behind self-attention what I am stuck at is the "multi head part" of the "MultiheadAttention-Layer". Looking at this explanation https://jalammar.github.io/illustrated-transformer/, which I generally found very good, it appears that multiple weight matrices (one set of weight matrices per head) are used to transform the original input value into the query, key and value, which are then used to calculate the attention scores …
Category: Data Science

Attention to multiple areas of same sentence

Lets consider some sentences below: "Datascience exchange is a wonderful platform to get answers to datascience related queries and it helps to learn various concepts too" "Can company1 buy company2? What will be their total turnover then?" "Coronavirus was originated in china. After that it is spreading all over the world. To prevent it everyone has to take care of cleanliness and prefer vegetarians." In all above sentences you can see there are multiple questions or utternaces. Sometimes separated by …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.