Feature extraction from sequence of images with Siamese Neural Network
I am trying to train a neural network to recognize certain actions in short movies.
Each such movie consists of a fixed number of frames, each frame - the image is of course the same size, after preliminary preprocessing.
And now I'd like to do some feature extraction of each of these images using the Siamese Neural Network (SNN). I found articles somewhere that SNN might be great for this, but without implementation details.
These articles show that they take a pair of two vectors together (in my case it will be a picture, but it does not matter if it is 1d or 2d) and then they use this SNN model which return to them how much the two vectors are similar to each other.
To quote the article:
We employ the contrastive loss function [29] while training our models. We choose this formulation over a standard classification loss function like cross entropy since our objective is to differentiate between two audio frames. Let X = (X1, X2) be the pair of inputs X1 and X2, W be the set of parameters to be learnt and Y be the target binary label (Y = 0 if they match and 1 if otherwise).
The article itself here: https://www.eecs.qmul.ac.uk/~simond/pub/2020/AgrawalDixon-EUSIPCO2020.pdf
Well, in general, I do not really understand how to feed this SNN with data, since I have these 100 movies, each has 35 frames, and each of these 100 movies is assigned y categories ranging from 0-10 (10 frames per action). And the SNN itself operates on a single frame, how do I train this SNN? I have to choose randomly a pair of two frames, regardless of what action they come from and whether they are from the same or not calculate this vector as close if from the same and far if from another?
Will such a model be useful at all? The first thing that came to my mind was not to compare single frames with each other in this SNN, but whole sequences and on this basis calculate the similarity of the entire sequence to another.
How do you think which approach is better and what exactly should be done in this approach with comparing single frames to each other?
Thank you in advance
EDIT: To be precise, I have 2 questions. First, which solution could work better, SNN operating on the entire sequence of 35 frames or maybe SNN operating on a single frame.
Second, for a single-frame approach, how exactly is that model supposed to work. Should these single two frames be chosen randomly? For example, frame 5 of 35 in Action X and frame 25 out of 35 in Action Y, will this even work? Or maybe selecting the i-th frame but from random actions, e.g. 5 frame from action X and 5 frame from action Y.