What are h(t-1) and c(t-1) for the first LSTM cell?

I know in a LSTM chain you should connect the h(t) of the previous cell to the h(t+1) of the next cell, and doing so for c(t). But what about the first cell? What does it get as h(t-1) and c(t-1)?

I also like to know, if we want to make a multi layer LSTM, What should we give to the first cell of the second layer as h and c? Also will we throw away the h and c of the last cell of the each layer?

And a little other question: will we give the y(output) of each cell to the x(input) of it's corresponding cell in the top layer?

Topic stacked-lstm lstm deep-learning neural-network

Category Data Science


Assuming that the LSTM is going to be used for sequence generation (e.g. in a language model or the decoder part of an encoder-decoder NMT architecture), we have the following:

In supervised learning setups:

In a language model, the LSTM's $h_{-1}$ and $c_{-1}$ are initialized to zeros, for all the layers. If the LSTM is the decoder part of an encoder-decoder NMT architecture, they will be initialized with the hidden state and context vector of the LSTM from the encoder part, either from the last position of from a combination of all positions by using some attention mechanism.

During training, the inputs $x$ are not fed with the LSTM predictions for the previous tokens because this presents convergence problems; instead, you use teacher forcing, which consists in feeding the gold tokens (the tokens from the training data) as inputs.

During inference, you have to use the network's own predictions as inputs for the next tokens.

The mismatch between the training and inference regimes is said to present exposure bias, that is, during training you make the network see only "good" previous tokens and it never learns to recover from its own bad predictions. However, according to recent research, exposure bias may actually not be a serious problem.

In more exotic learning setups:

In textual GANs, $h_{-1}$ and $c_{-1}$ are sometimes initialized with a latent vector $z$ that follows some prior distribution.

Also, in textual GANs and other learning setups where there are no gold tokens, teacher forcing is not an option, so you need to use the LSTM predictions as previous token inputs.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.