Simple explanation with images
We know that an activation is required between matrix multiplications to afford a neural network the ability to model non-linear processes.
A classical LSTM cell already contains quite a few non-linearities: three sigmoid
functions and one hyperbolic tangent (tanh
) function, here shown in a sequential chain of repeating (unrolled) recurrent LSTM cells:
Images borrowed from "colah's blog"
So far this is just a single LSTM layer, and here we see that the cell output is already the multiplication of two activations (a sigmoid and a hyperbolic tangent). In this case, you could agree there is no need to add another activation layer after the LSTM cell.
You are talking about stacked layers, and if we put an activation between the hidden output of one layer to the input of the stacked layer. Looking at the central cell in the image above, it would mean a layer between the purple ($h_{t}$) and the stacked layer's blue $X_{t}$. You will notice then, that the output in this case, like with the sequential output, has already been activated, as it is the exact same output as the black left-to-right arrow (know as $f_{t}$). What is more, the first thing the input would do in the stacked layer is to be passed through the sigmoids and hypoerbolic tangents of the forget/input/output gates.
So there are plenty of non-linearities being used, meaning it is unnecessary to add yet another between the stacked LSTM layers. You might like to think of it as simply applying two ReLU
layers after a fully-connected layer. The results might be slightly different compared to just using one, but not much; as in your experiements with stacked LSTMs.
Documentation
If you look at the Tensorflow/Keras documentation for LSTM modules (or any recurrent cell), you will notice that they speak of two activations: an (output) activation and a recurrent activation. It is here that you can decide which activation to use and the output of the entire cell is then already activated, so to speak. PyTorch doesn't seem to (by default) allow you to change the default activations.
Real world stacked models
Common applications of recurrent networks are found in NLP, for example the ELMo model. If you look through the network design code, you see only basic LSTM cells being used, without additional activation laters. They only mention adding activations for the fully-connected layers (namely a ReLU) at the final output.
The first usage of stacked LSTMs (that I know of) was applied to speech recognition (Graves et. al), and the authors also do not mention the need for activation layers between the LSTM cells; only at the final output in conjunction with a fully-connected layer.