How to prepare Audio-text data for speech recognition
I have gathered some raw audio from all the conferences, meetings, lectures casual conversation that I was part of. The machine transcription did not offer good results (from Azure, AWS etc.) I would transcribe it so to have both data+label (audio+text) for ML training.
My question is whether to have small (3-10 sec.) audio files (split it at silence) and then transcribe each small file? or large file with timestamps with subtitle.srt format? What if I have a long duration audio file with text? I heard long files have more chances of errors not accurate training. What if I add timestamp like subtitle file srt? Do I need small audio files? I tried Azure custom Speech to train and test but it threw errors, saying it wont process large audio files. (so small chunks is recommended.) What other ML platforms (AWS, Watson, GCP) have it their data labelling criteria? Sorry I couldn't find other than MS Azure. Ideally I shall have its own speech recognition system with clean slate (open to hear suggestions on model selection), but need to know what format style the data should be created.
The way I see it, audio splitting (say, cutting a 30 mins audio into 200 parts) can be automated but then, how to separate transcripts into 200 lines? (need to check manually to linebreak.), So, not a good option to go for large datasets. Thus it's important to decide the data format before working on it (for assigning the transcribers the proper instructions). So the question again: with the clean slate (a) to have large audio files with timestamp transcript, or (b) to have small audio files with single line text? how? Please guide. I did bit research but finally dared to post a question.
Topic speech-to-text dataset data-cleaning
Category Data Science