Stacking LSTM layers

Can someone please tell me the difference between those stacked LSTM layers?

First image is given in this question and second image is given in this article. So far what I learned about stacking LSTM layers was based on the second image. When you build layers of LSTM where output of one layer (which is $h^{1}_{l}, l=..., t-1, t, t+1...$) becomes input of others, it is called stacking. In stacked LSTMs, each LSTM layer outputs a sequence of vectors which will be used as an input to a subsequent LSTM layer. However, in the first image, the input variables are fed again into second layer. Can someone tell me whether there is something wrong about stacking LSTM layers like given in the first image?

Topic stacked-lstm rnn self-study deep-learning neural-network

Category Data Science


You are correct that "stacking LSTMs" means to put layers on top of one-another as in your second image.

The first picture is a "bi-directional LSTM" (BiLSTM), whereby we can analyse a point in a series (e.g. a word in a sentence) from both sides. We care about the context of that point.

The most common example I know of is within NLP. Here we want to know the representation of a word in a gap, how it is found between other words. If we have the entire sentence, we can look at the words before and the words after our word. In this case, we could use a bi-drectional LSTM to process the sequence in the opposite direction, which your first diagram shows.

Let's play a game, and say you need to guess the missing word in this text snippet:

i need to review an __________ ...

What could it be? "article", "iPad", "aerial image" ?

Here is the solution:

i need to review an article, ...

It was incredibly hard to get that right - perhaps impossible! Well, maybe not if you have some context with it. How about I give you both sides of that snippet:

i need to review an ________, for tomorrow's newspaper.

A BiLSTM would be fed the sentence from both sides, thus letting it see some more context to understand each word.

Have a look at this article, which eventually get the bi-directional networks. Here is a similar question to yours with a few nice answers.


In time-series data, such as device readings from IoT devices or the stock market, using such a bi-directional model wouldn't make sense, as we would be violating the temporaneous flow of information i.e. we cannot use information from the future to help predict the present. That isn't a problem in text analysis, voice-recordings or network analysis on sub-network traffic flow.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.