How to double audio dataset?
I am trying to develop a mispronunciation detection model for English speech. I use TIMIT dataset, this is phoneme labeled audio dataset.
A phoneme is any of the perceptually distinct units of sound. So, my dataset looks like an audio file and string of phonemes corresponding to that audio. Ex:
SX141.wav - p l eh zh tcl t ax-h pcl p axr tcl t ih s pcl p ey dx ih n ax v aa dx ix z ix kcl k w aa dx ix kcl k ah m pcl p tcl t ih sh ix n
So, the problem is overfitting. My model is very good at training, but poor on testing. So because of this, I want to try synthetically increase my dataset. Maybe change the speed of audio or add some background noises etc.
Are there any already-ready solutions for doubling the audio dataset? Or, how to change speed and add some noises on the audio file? Will be it helpful?