What happens when the length of input is shorter than length of output in transformer architecture?

Given standard transformer architecture with encoder and decoder.

What happens when the input for the encoder is shorter than the expected output from the decoder?

The decoder is expecting to receive value and key tensors from the encoder which size is dependent on the amount of input token.

I could solve this problem during training by padding input and outputs to the same size.

But how about inference, when I don't know the size of the output?

Should I make a prediction and if the decoder doesn't output the stop token within range of available size, re-encode inputs with more padding and try again?

What are the common approaches to this problem? Thanks in advance, have a great day :)

Topic transformer sequence-to-sequence nlp

Category Data Science


The input and output length are totally independent. Unlike sequence labeling problems, where we assign one label to each of the input states, encoder-decoder models do not have this limitation.

Each decoder state (and there are as many decoder states as output tokens) queries the encoder states and no matter how many encoder states there are, the decoder receives one context vector (a weighted average of values + perhaps some projections).

At training time, we know all the decoder states (i.e., queries) in advance, so we can query the encoder states in parallel and receive one context vector per decoder state.

At inference time, this needs to be done step by step: the decoder generates a token, the token goes to the decoder input, it generates a new state, the state is used to query the encoder, encoder returns one context vector (multiple times in multilayer decoders), the decoder generates a new token. This is iterated until the decoder generates the end-of-sentence token.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.