multimodal learning can be complex (like anything) but it can also be fairly simple.

The general idea of multimodal modeling is to take data consumed in parallel, which has different "modes" which are very different to one another (like audio, video, and a text description) to predict something (if the video is about cats, for instance). This can be difficult on this type of data because the modeling strategies for audio, video, and text are all very different.

The general approach to multimodal learning is to create one (or more) models for each mode, then create a high level model which consumes the output from the other models to generate the final output. Something like this:

audio -> recognize cat noises -> ?is cat noise in audio ----------------v
video -> recognize cat images -> ?is cat in video -----------------------> final model
text -> recognize text with or relating to cats -> ?is cat in the text -^

it's a lot of work, but it's not very different from simpler modeling strategies: you just train each modal model, then create a dataset with the outputs of your models being inputs for your final model. To me, it seems like a specific use case for ensemble learning.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.