Action Recognition for multiple objects and localization

Question

Action Recognition for multiple objects and localization

Dmitry

2021年3月12日 05:57

I want to ask questions regarding the action detection on the video with proposed frames. I've used Temporal 3D ConvNet for the action recognition on video. Successfully trained it and can recognize action on videos.

When I do inference, I just collect 20 frames from video, feed it to the model and it gives me the result. The point is that events on different videos are not similar in size. Some of them cover 90% of the frame, but some May 10%. Let's take as an example that two objects collided and it can happen on a different scale, and I want to detect this action.

How provide to the model exact position for the action recognition, if it can happen on a different scale with different objects? What comes to mind is to use Yolo to collect Regions of Interest and feed collected frames every time the 3D convnet. But if there are a lot of objects, the speed will be very slow. How to handle it?

Are there any end-to-end solutions for the action recognition with the object location proposal for the action recognition network?

I've already looked at papers and blogs, what people suggest, couldn't find the solution for the localization issues, so the action recognition model got the correct frames.

Topic activity-recognition object-detection classification machine-learning

Category Data Science

thanatoz · Accepted Answer · 2019年3月23日 10:07

So finding actions from videos happens to be a tricky task. I have no idea about temporal 3D convnet but in order to tackle a problem like this, I would couple the CNN layer on individual frames of video and then feed the frame timeline into another layer of LSTM in order to find the context of the video.

As the action being performed on the video covers anywhere from 10% to 90% of the frame, you can perform TestTimeAugmentation on the video in order to find the action with a higher confidence rate. Similar approach could be found in this video by Google.

Action Recognition for multiple objects and localization

About