Segment 5-7 min audio into sentence wise audio clips for creating speech recognition dataset

I am trying to create a speech recognition dataset, especially for Indian Accents. I am taking from colleagues to build this. Daily I send an article link and ask them to record and upload it to google drive.

I have a problem with this approach. All audio recordings of length 5 -7 min. I am using the DeepSpeech model for this and it requires 10-sec audio sentences.

Suggest me any approach if possible to segment audio files into corresponding sentence phrases or to build a better with 5 min length audio files. Suggestions are more than welcome on a better way to create a speech-to-text dataset.

Topic speech-to-text audio-recognition dataset

Category Data Science


The typical approach is to just cut the clips into consecutive sections, and run the model on each such section. Sometimes a bit of overlap is used, say 10%. then you have to decide what to do with potential conflicts in these overlaps. A good model is usually robust against silence, otherwise you can try to cut silence in start and end of your 10-second window.

librosa.util.frame is a practical way of doing this in Python.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.