How to train my model on unpaired speech and text for my speech recognition model?

When we have unpaired data, means we have a dataset of audios and a dataset of texts, BUT they are not associated, and as we know, to build a speech recognition model we do have to combine in our input dataset each speech with its adequate text and then train our model upon that to have a model capable to convert new audio to its text.

For my case, while doing data collection, I wind up with audios and text but not matched, because am working with the dialect of a language which is low-resource language. so I do not have a lot of data so data augmentation need to be done in this regard, but how? What do you suggest to do?

Topic speech-to-text

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.