What is the purpose of Sequence Length parameter in RNN (specifically on PyTorch)?

I am trying to understand RNN. I got a good sense of how it works on theory. But then on PyTorch you have two extra dimensions to your input data: batch size (number of batches) and sequence length. The model I am working on is a simple one to one model: it takes in a letter than estimates the following letter. The model is provided here.

First please correct me if I am wrong about the following: Batch size is used to divide the data into batches and feed it into model running in parallels. At least this was the case in regular NNs and CNNs. This way we take advantage of the processing power. It is not ideal in the sense that in theory for RNN you just go from one end to another one in an unbroken chain.

But I could not find much information on sequence length. From what I understand it breaks the data into the lengths we provide, instead of keeping it as an unbroken chain. Then unrolls the model for the length of that sequence. If it is 50, it calculates the model for a sequence of 50. Let's think about the first sequence. We initialize a random hidden state, the model first does a forward run on these 50 inputs, then does backpropagation. But my question is, then what happens? Why don't we just continue? What happens when it starts the new sequence? Does it initialize a random hidden state for the next sequence or does it use the hidden state calculated from the very last entry from the previous sequence? Why do we do that, and not just have one big sequence? Does not this break the continuity of the model? I read somewhere it is also memory related; if you put the whole text as sequence, gradient calculation would take the whole memory it said. Does it mean it resets the gradients after each sequence?

Thank you very much for the answers

Topic pytorch lstm rnn

Category Data Science


  • The RNN receives as input a batch of sequences of characters. The output of the RNN is a tensor with sequences of character predictions, of just the same size of the input tensor.
  • The number of sequences in each batch is the batch size.
  • Every sequence in a single batch must be the same length. In this case, all sequences of all batches have the same length, defined by seq_length.
  • Each position of the sequence is normally referred to as a "time step".
  • When back-propagating an RNN, you collect gradients through all the time steps. This is called "back-proparation through time (BPTT)".
  • You could have a single super long sequence, but the memory required for that would be large, so normally you must choose a maximum sequence length.
  • To somewhat mitigate the need of cutting the sequences, people normally apply something called "truncated BPTT". That is what the code you linked uses. It consists of having the sequences in the batches arranged so that each of the sequences in the next batch are the continuation of the text from each of the sequences in the previous batch, together with reusing the last hidden state of the previous batch as the initial hidden state of the next one.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.