Feature extraction from sequence of images with Siamese Neural Network

I am trying to train a neural network to recognize certain actions in short movies.

Each such movie consists of a fixed number of frames, each frame - the image is of course the same size, after preliminary preprocessing.

And now I'd like to do some feature extraction of each of these images using the Siamese Neural Network (SNN). I found articles somewhere that SNN might be great for this, but without implementation details.

These articles show that they take a pair of two vectors together (in my case it will be a picture, but it does not matter if it is 1d or 2d) and then they use this SNN model which return to them how much the two vectors are similar to each other.

To quote the article:

We employ the contrastive loss function [29] while training our models. We choose this formulation over a standard classification loss function like cross entropy since our objective is to differentiate between two audio frames. Let X = (X1, X2) be the pair of inputs X1 and X2, W be the set of parameters to be learnt and Y be the target binary label (Y = 0 if they match and 1 if otherwise).

The article itself here: https://www.eecs.qmul.ac.uk/~simond/pub/2020/AgrawalDixon-EUSIPCO2020.pdf

Well, in general, I do not really understand how to feed this SNN with data, since I have these 100 movies, each has 35 frames, and each of these 100 movies is assigned y categories ranging from 0-10 (10 frames per action). And the SNN itself operates on a single frame, how do I train this SNN? I have to choose randomly a pair of two frames, regardless of what action they come from and whether they are from the same or not calculate this vector as close if from the same and far if from another?

Will such a model be useful at all? The first thing that came to my mind was not to compare single frames with each other in this SNN, but whole sequences and on this basis calculate the similarity of the entire sequence to another.

How do you think which approach is better and what exactly should be done in this approach with comparing single frames to each other?

Thank you in advance

EDIT: To be precise, I have 2 questions. First, which solution could work better, SNN operating on the entire sequence of 35 frames or maybe SNN operating on a single frame.

Second, for a single-frame approach, how exactly is that model supposed to work. Should these single two frames be chosen randomly? For example, frame 5 of 35 in Action X and frame 25 out of 35 in Action Y, will this even work? Or maybe selecting the i-th frame but from random actions, e.g. 5 frame from action X and 5 frame from action Y.

Topic siamese-networks computer-vision deep-learning neural-network machine-learning

Category Data Science


Which is better single frame similarly check or sequence check?

Sequence Check. Because in the Single frame Similarity check (Consider SNN), You need to input pairwise images. So that you need so many pairwise images for your SNN for your Batch Generation.

For the Batch Generation idea is to make usuable batches for training the network. We need to create parallel inputs for the A and B images where the output is the distance. Here we make the naive assumption that if images are in the same group the similarity is 1 otherwise it is 0. If we randomly selected all of the images we would likely end up with most images in different groups.

So no problem if you chose randomly or not.

For the Sequence Check, I picked this paper : Deepfake Video Detection Using Recurrent Neural Networks

enter image description here

Here propose a two-stage analysis composed of a CNN to extract features at the frame level followed by a temporally-aware RNN network to capture temporal inconsistencies between frames. And in this paper, they actually detect deepfake from the videos. They used 600 videos to evaluate the proposed method and get 94% more accuracy.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.