Below is one example Attention-based Encoder-decoder network for multivariate time series forecasting task. I want to visualize the attention weights. input_ = Input(shape=(TIME_STEPS,N)) x = attention_block(input_) x = LSTM(512, return_sequences=True)(x) x = LSTM(512)(x) x = RepeatVector(n_future)(x) x = LSTM(128, activation='relu', return_sequences=True)(x) x = TimeDistributed(Dense(128, activation='relu'))(x) x = Dense(1)(x) model = Model(input_,x) model.compile(loss="mean_squared_error",optimizer="adam",metrics=["acc"]) print(model.summary()) Here is the implementation of my attention block: def attention_block(inputs): x=Permute((2,1))(inputs) x=Dense(TIME_STEPS,activation="softmax")(x) x=Permute((2,1),name="attention_prob")(x) x=multiply([inputs,x]) return x I will highly appreciate if a fresh implementation of the attention …
I am trying to wrap my head around the different special tokens that the different transformer architectures use. For example, let's say we have the following input and target both for a text generation example and for a text classification example: Input: My cat is black Target_generation: He is a good cat Target_classification: Positive Now, for the text classification, using something like BERT, I know I have to do the following: Encoder input: [CLS, "My", "cat", "is", "black"] Pool the …
I am a beginner in machine learning and I'm trying to create a spelling correction model that spell checks for a small amount of vocab (approximately 1000 phrases). Currently, I am refering to the tensorflow 2.0 tutorials for 1. NMT with Attention, and 2. Text Generation. I have completed up to an encoding layer but currently I am having some issue matching up the shape of the following layers (decoder and attention) with the previous (encoder). The encoder in the …
I am trying to implement the paper titled Learning Cross-lingual Sentence Representations via a Multi-task Dual-Encoder Model. Here the encoder and decoder share the same weights but I am unable to put it in code. Any links ?
The task involving determining the gender of the creator of a Reddit post. Given a post and its title, I need a model to output a probability vector $[p_{male},p_{female}]$. The difficulty here is that the training set is very small: we have only labeled 5000 posts. In addition, the average length of sentence exceed 90, making it hard to extract features. Currently, we are using non-deep learning methods to perform this task because of the small size of dataset: use …
I am trying to access the hidden layers when using TransformerEncoder and TransformerEncoderLayer. I could not find anything like that in the source code for these classes. I am not using hugging face but I know one can get hidden_states and last_hidden_state. I am looking for something similar. Do you know how I can access them?
I am learning Transformer and studying Decoding, such as beam search and random sampling which are easy to understand. However, when it comes to Minimum Bayes Risk, it is more difficult. Please help
I'm trying to earn some extra credit, so the professor won't elaborate further on what's being asked in this question: The dataset that we're given is a line-by-line file of protein sequences (something like this: LVPRGSHMASMTGGQQMGRGSMVSSSSSGSDSLLLLSEECLLSASSGSGIQIQICKQIPKDWIYSYQVEEGSDLT) What on earth is he asking about the encoder-decoder? Aren't these used to encode some information (like an English sentence) and then decode it into some other data (like a Spanish sentence)? What should I be encoding and decoding in this scenario? Thank you
I am using a temporal 3D U-NET (time dimension + 2 spatial dimensions) to forecast physical features of fluid (pressure, temperature, and velocities) using data from a simulator. I am thinking of using squeeze and excitation in the encoder to capture small scale-large scale movements correlations. So my question is: how can I add the squeeze and excitation block to the 3D U-Net architecture? Thanks.
I have time-series as my data (one time-series per training example). I would like to encode the data within these series in a fixed-length vector of features using a keras model. The problem is that my different examples' time-series don't have the same lengths. I haven't found a way of doing that. The problem of the encoder-decoder thing is that if the input lengths vary, the output lengths do this also. But I would like to have an output of …
Assume I have a multi-class classification task. The labels are: Class 1 Class 2 Class 3 After LabelEncoder(), the labels are transformed into 0-1-2. My questions are: Do the labels have to start from 0? Do the labels have to be sequential? What happens if I replace all label 0s with 3 so that my labels are 1-2-3 instead of 0-1-2 (This is done before training) If the labels were numeric such as 10-100-1000, will I still have to use …
I need to use encoder-decoder structure to predict 2D trajectories. As almost all available tutorials are related to NLP -with sparse vectors-, I couldn't be sure about how to adapt the solutions to a continuous data. In addition to my ignorance in seqence-to-sequence models, embedding process for words confused me more. I have a dataset that consists of 3,000,000 samples each having x-y coordinates (-1, 1) with 125 observations, which means the shape of each sample is (125, 2). I …
Positional encoding using sine-cosine functions is often used in transformer models. Assume that $X \in R^{l\times d}$ is the embedding of an example, where $l$ is the sequence length and $d$ is the embedding size. This positional encoding layer encodes $X$’s position $P \in R^{l\times d}$ and outputs $P + X$ The position $P$ is a 2-D matrix, where $i$ refers to the order in the sentence, and $j$ refers to the position along the embedding vector dimension. In this …
I'm doing some research for the summarization task and found out BERT is derived from the Transformer model. In every blog about BERT that I have read, they focus on explaining what is a bidirectional encoder, So, I think this is what made BERT different from the vanilla Transformer model. But as far as I know, the Transformer reads the entire sequence of words at once, therefore it is considered bidirectional too. Can someone point out what I'm missing?
I understand the mechanics of Encoder-Decoder architecture used in the Attention Is All You Need paper. My question is more high level about the role of the decoder. Say we have a sentence translation task: Je suis ètudiant -> I am a student The encoder receives Je suis ètudiant as the input and generates encoder output which ideally should embed the context/meaning of the sentence. The decoder receives this encoder output and an input query (I, am, a, student) as …
I have a model with several categorical features that need to be converted to numeric format. I am using a combination of LabelEncoder and OneHotEncoder to achieve this. Once in production, I need to apply the same encoding to new incoming data before the model can be used. I've saved on disk the model and the encoders using pickle. The problem here is that the LabelEncoder keeps only the last set of classes (for the last feature it has encoded), …
I have rather theory-based question as I'm not that experienced in encoders, embeddings etc. Scientifically I'm mostly oriented around novel evolutionary model-based methods. Let's assume we have data set with highly correlated attributes. Usually encoders are trained to learn representation in lesser number of dimensions. What I'm wondering about is quite the opposite. Would it be possible to learn encoding to higher number of dimensions but less correlated (wishfully non-correlated)? The idea is to turn less-dimensional, very tough problem to …