Assuming that the LSTM is going to be used for sequence generation (e.g. in a language model or the decoder part of an encoder-decoder NMT architecture), we have the following:
In supervised learning setups:
In a language model, the LSTM's $h_{-1}$ and $c_{-1}$ are initialized to zeros, for all the layers. If the LSTM is the decoder part of an encoder-decoder NMT architecture, they will be initialized with the hidden state and context vector of the LSTM from the encoder part, either from the last position of from a combination of all positions by using some attention mechanism.
During training, the inputs $x$ are not fed with the LSTM predictions for the previous tokens because this presents convergence problems; instead, you use teacher forcing, which consists in feeding the gold tokens (the tokens from the training data) as inputs.
During inference, you have to use the network's own predictions as inputs for the next tokens.
The mismatch between the training and inference regimes is said to present exposure bias, that is, during training you make the network see only "good" previous tokens and it never learns to recover from its own bad predictions. However, according to recent research, exposure bias may actually not be a serious problem.
In more exotic learning setups:
In textual GANs, $h_{-1}$ and $c_{-1}$ are sometimes initialized with a latent vector $z$ that follows some prior distribution.
Also, in textual GANs and other learning setups where there are no gold tokens, teacher forcing is not an option, so you need to use the LSTM predictions as previous token inputs.