How to Visualize attention weights in a Attention based Encoder-Decoder network in Time series forecasting

Below is one example Attention-based Encoder-decoder network for multivariate time series forecasting task. I want to visualize the attention weights. input_ = Input(shape=(TIME_STEPS,N)) x = attention_block(input_) x = LSTM(512, return_sequences=True)(x) x = LSTM(512)(x) x = RepeatVector(n_future)(x) x = LSTM(128, activation='relu', return_sequences=True)(x) x = TimeDistributed(Dense(128, activation='relu'))(x) x = Dense(1)(x) model = Model(input_,x) model.compile(loss="mean_squared_error",optimizer="adam",metrics=["acc"]) print(model.summary()) Here is the implementation of my attention block: def attention_block(inputs): x=Permute((2,1))(inputs) x=Dense(TIME_STEPS,activation="softmax")(x) x=Permute((2,1),name="attention_prob")(x) x=multiply([inputs,x]) return x I will highly appreciate if a fresh implementation of the attention …
Category: Data Science

Special tokens for encoder and decoder in the transformer architecture

I am trying to wrap my head around the different special tokens that the different transformer architectures use. For example, let's say we have the following input and target both for a text generation example and for a text classification example: Input: My cat is black Target_generation: He is a good cat Target_classification: Positive Now, for the text classification, using something like BERT, I know I have to do the following: Encoder input: [CLS, "My", "cat", "is", "black"] Pool the …
Category: Data Science

How to add a Decoder & Attention Layer to Bidirectional Encoder with tensorflow 2.0

I am a beginner in machine learning and I'm trying to create a spelling correction model that spell checks for a small amount of vocab (approximately 1000 phrases). Currently, I am refering to the tensorflow 2.0 tutorials for 1. NMT with Attention, and 2. Text Generation. I have completed up to an encoding layer but currently I am having some issue matching up the shape of the following layers (decoder and attention) with the previous (encoder). The encoder in the …
Category: Data Science

Natural Language gender classification task with very small training set

The task involving determining the gender of the creator of a Reddit post. Given a post and its title, I need a model to output a probability vector $[p_{male},p_{female}]$. The difficulty here is that the training set is very small: we have only labeled 5000 posts. In addition, the average length of sentence exceed 90, making it hard to extract features. Currently, we are using non-deep learning methods to perform this task because of the small size of dataset: use …
Topic: encoder tfidf nlp
Category: Data Science

Get Hidden Layers in PyTorch TransformerEncoder

I am trying to access the hidden layers when using TransformerEncoder and TransformerEncoderLayer. I could not find anything like that in the source code for these classes. I am not using hugging face but I know one can get hidden_states and last_hidden_state. I am looking for something similar. Do you know how I can access them?
Category: Data Science

Can anyone interpret this Recurrent Network Encoder-Decoder question?

I'm trying to earn some extra credit, so the professor won't elaborate further on what's being asked in this question: The dataset that we're given is a line-by-line file of protein sequences (something like this: LVPRGSHMASMTGGQQMGRGSMVSSSSSGSDSLLLLSEECLLSASSGSGIQIQICKQIPKDWIYSYQVEEGSDLT) What on earth is he asking about the encoder-decoder? Aren't these used to encode some information (like an English sentence) and then decode it into some other data (like a Spanish sentence)? What should I be encoding and decoding in this scenario? Thank you
Category: Data Science

Squeeze and excitation blocks in 3D convnet architectures to forecast physical systems

I am using a temporal 3D U-NET (time dimension + 2 spatial dimensions) to forecast physical features of fluid (pressure, temperature, and velocities) using data from a simulator. I am thinking of using squeeze and excitation in the encoder to capture small scale-large scale movements correlations. So my question is: how can I add the squeeze and excitation block to the 3D U-Net architecture? Thanks.
Category: Data Science

Encode time-series of different lengths with keras

I have time-series as my data (one time-series per training example). I would like to encode the data within these series in a fixed-length vector of features using a keras model. The problem is that my different examples' time-series don't have the same lengths. I haven't found a way of doing that. The problem of the encoder-decoder thing is that if the input lengths vary, the output lengths do this also. But I would like to have an output of …
Category: Data Science

Changing order of LabelEncoder() result

Assume I have a multi-class classification task. The labels are: Class 1 Class 2 Class 3 After LabelEncoder(), the labels are transformed into 0-1-2. My questions are: Do the labels have to start from 0? Do the labels have to be sequential? What happens if I replace all label 0s with 3 so that my labels are 1-2-3 instead of 0-1-2 (This is done before training) If the labels were numeric such as 10-100-1000, will I still have to use …
Category: Data Science

Encoder-Decoder LSTM for Trajectory Prediction

I need to use encoder-decoder structure to predict 2D trajectories. As almost all available tutorials are related to NLP -with sparse vectors-, I couldn't be sure about how to adapt the solutions to a continuous data. In addition to my ignorance in seqence-to-sequence models, embedding process for words confused me more. I have a dataset that consists of 3,000,000 samples each having x-y coordinates (-1, 1) with 125 observations, which means the shape of each sample is (125, 2). I …
Category: Data Science

Why transform embedding dimension in sin-cos positional encoding?

Positional encoding using sine-cosine functions is often used in transformer models. Assume that $X \in R^{l\times d}$ is the embedding of an example, where $l$ is the sequence length and $d$ is the embedding size. This positional encoding layer encodes $X$’s position $P \in R^{l\times d}$ and outputs $P + X$ The position $P$ is a 2-D matrix, where $i$ refers to the order in the sentence, and $j$ refers to the position along the embedding vector dimension. In this …
Category: Data Science

What is the difference between BERT architecture and vanilla Transformer architecture

I'm doing some research for the summarization task and found out BERT is derived from the Transformer model. In every blog about BERT that I have read, they focus on explaining what is a bidirectional encoder, So, I think this is what made BERT different from the vanilla Transformer model. But as far as I know, the Transformer reads the entire sequence of words at once, therefore it is considered bidirectional too. Can someone point out what I'm missing?
Category: Data Science

Role of decoder in Transformer?

I understand the mechanics of Encoder-Decoder architecture used in the Attention Is All You Need paper. My question is more high level about the role of the decoder. Say we have a sentence translation task: Je suis ètudiant -> I am a student The encoder receives Je suis ètudiant as the input and generates encoder output which ideally should embed the context/meaning of the sentence. The decoder receives this encoder output and an input query (I, am, a, student) as …
Category: Data Science

sklearn serialize label encoder for multiple categorical columns

I have a model with several categorical features that need to be converted to numeric format. I am using a combination of LabelEncoder and OneHotEncoder to achieve this. Once in production, I need to apply the same encoding to new incoming data before the model can be used. I've saved on disk the model and the encoders using pickle. The problem here is that the LabelEncoder keeps only the last set of classes (for the last feature it has encoded), …
Category: Data Science

Encoding correlation

I have rather theory-based question as I'm not that experienced in encoders, embeddings etc. Scientifically I'm mostly oriented around novel evolutionary model-based methods. Let's assume we have data set with highly correlated attributes. Usually encoders are trained to learn representation in lesser number of dimensions. What I'm wondering about is quite the opposite. Would it be possible to learn encoding to higher number of dimensions but less correlated (wishfully non-correlated)? The idea is to turn less-dimensional, very tough problem to …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.