Activation function between LSTM layers

I'm aware the LSTM cell uses both sigmoid and tanh activation functions internally, however when creating a stacked LSTM architecture does it make sense to pass their outputs through an activation function (e.g. ReLU)?

So do we prefer this:

model = LSTM(100, activation="relu", return_sequences=True, input_shape(timesteps, n_features))
model = LSTM(50, activation="relu", return_sequences=True)(model)
...

over this?

model = LSTM(100, return_sequences=True, input_shape(timesteps, n_features))
model = LSTM(50, return_sequences=True)(model)
...

From my empirical results when creating an LSTM-autoencoder I've found them to be quite similar.

Topic stacked-lstm lstm keras deep-learning machine-learning

Category Data Science


model = LSTM(100, return_sequences=True, input_shape(timesteps, n_features))
model = LSTM(50, return_sequences=True)(model)
...

Documenntation says: this LSTM implementation defaultly has

activation="tanh",
recurrent_activation="sigmoid",

So you should select another activation function if you want a different one.


Simple explanation with images

We know that an activation is required between matrix multiplications to afford a neural network the ability to model non-linear processes.

A classical LSTM cell already contains quite a few non-linearities: three sigmoid functions and one hyperbolic tangent (tanh) function, here shown in a sequential chain of repeating (unrolled) recurrent LSTM cells:

Sequential (unrolled) LSTM cells Symbol legend Images borrowed from "colah's blog"

So far this is just a single LSTM layer, and here we see that the cell output is already the multiplication of two activations (a sigmoid and a hyperbolic tangent). In this case, you could agree there is no need to add another activation layer after the LSTM cell.

You are talking about stacked layers, and if we put an activation between the hidden output of one layer to the input of the stacked layer. Looking at the central cell in the image above, it would mean a layer between the purple ($h_{t}$) and the stacked layer's blue $X_{t}$. You will notice then, that the output in this case, like with the sequential output, has already been activated, as it is the exact same output as the black left-to-right arrow (know as $f_{t}$). What is more, the first thing the input would do in the stacked layer is to be passed through the sigmoids and hypoerbolic tangents of the forget/input/output gates.

So there are plenty of non-linearities being used, meaning it is unnecessary to add yet another between the stacked LSTM layers. You might like to think of it as simply applying two ReLU layers after a fully-connected layer. The results might be slightly different compared to just using one, but not much; as in your experiements with stacked LSTMs.

Documentation

If you look at the Tensorflow/Keras documentation for LSTM modules (or any recurrent cell), you will notice that they speak of two activations: an (output) activation and a recurrent activation. It is here that you can decide which activation to use and the output of the entire cell is then already activated, so to speak. PyTorch doesn't seem to (by default) allow you to change the default activations.

Real world stacked models

Common applications of recurrent networks are found in NLP, for example the ELMo model. If you look through the network design code, you see only basic LSTM cells being used, without additional activation laters. They only mention adding activations for the fully-connected layers (namely a ReLU) at the final output.

The first usage of stacked LSTMs (that I know of) was applied to speech recognition (Graves et. al), and the authors also do not mention the need for activation layers between the LSTM cells; only at the final output in conjunction with a fully-connected layer.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.