RCNN to predict sequence of images (video frames)?

In the following work the authors apply a convolutional recurrent neural network (RNN) to predict the spatiotemporal evolution of microstructure represented by 2D image sequences. In particular, they apply some sort of 3D-CNN LSTM architecture to predict crystal growth:

In the above picture, we can see RNN predictions (P) versus ground truth (G) from a testing case, in which the RNN outputs 50 frames based on 10 input frames.

Now, this is something new to me: how is possible for a RCNN to generate images as output? From my (limited) knowledge, the only structures able to generate new images in output are generative adversarial networks like GANs and Convolutional Encoder-Decoder NN (like VAE), but apparently the authors achieve these results by solely stacking together 3D-Convs and RNN units.

Have you ever seen these kind of architectures?

Topic recurrent-neural-network generative-models gan cnn

Category Data Science


The authors provide this image in their supplemental information: enter image description here

There, you can see their explanation. The convolutional layers encode the image into some latent space representation. The RNN operates in this latent space, generating a new latent space representation based on the previous observations. For any latent space representation, the decoder can convert it into an image.

Thus the RCNN uses essentially the same procedure as the model types you mentioned (GANs, Convolutional Encoder-Decoders); there is a decoder that takes representations from the latent space to the image space.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.