Multi-modal neural network

Question

Multi-modal neural network

Nikto

2021年10月20日 21:35

How do you start with creating a multi-modal neural network?

Topic multitask-learning neural-network machine-learning

Category Data Science

Warlax56 · Accepted Answer · 2021年10月20日 18:21

multimodal learning can be complex (like anything) but it can also be fairly simple.

The general idea of multimodal modeling is to take data consumed in parallel, which has different "modes" which are very different to one another (like audio, video, and a text description) to predict something (if the video is about cats, for instance). This can be difficult on this type of data because the modeling strategies for audio, video, and text are all very different.

The general approach to multimodal learning is to create one (or more) models for each mode, then create a high level model which consumes the output from the other models to generate the final output. Something like this:

audio -> recognize cat noises -> ?is cat noise in audio ----------------v
video -> recognize cat images -> ?is cat in video -----------------------> final model
text -> recognize text with or relating to cats -> ?is cat in the text -^

it's a lot of work, but it's not very different from simpler modeling strategies: you just train each modal model, then create a dataset with the outputs of your models being inputs for your final model. To me, it seems like a specific use case for ensemble learning.

Multi-modal neural network

About