How to double audio dataset?

I am trying to develop a mispronunciation detection model for English speech. I use TIMIT dataset, this is phoneme labeled audio dataset.

A phoneme is any of the perceptually distinct units of sound. So, my dataset looks like an audio file and string of phonemes corresponding to that audio. Ex:

SX141.wav - p l eh zh tcl t ax-h pcl p axr tcl t ih s pcl p ey dx ih n ax v aa dx ix z ix kcl k w aa dx ix kcl k ah m pcl p tcl t ih sh ix n

So, the problem is overfitting. My model is very good at training, but poor on testing. So because of this, I want to try synthetically increase my dataset. Maybe change the speed of audio or add some background noises etc.

Are there any already-ready solutions for doubling the audio dataset? Or, how to change speed and add some noises on the audio file? Will be it helpful?

Topic speech-to-text machine-learning-model audio-recognition neural-network dataset

Category Data Science


I did not find the ready solution for it. I solve this task by myself.

  1. Increase speed.

     from scipy.io.wavfile import read, write
    
     Fs, data = read(filename)
     write(destination, int(Fs*1.25), data)
    

I save the file and increase its frequency by 1.25.

  1. Add noise.

     import numpy as np
     from scipy.io.wavfile import read, write
    
     Fs, data = read(filename)
     data_noise = np.random.normal(0, .2, data.shape)
     write(destination, int(Fs), data+data_noise) 
    

Here I generate the noise array and add it to the original wav signal.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.