What is the suggested way to create features (Mel-Spectograms) from speech signal for classification with ResNet?
At the moment I have this piece of code which cuts a Spectogram into fixed length tensors:
def chunks(l, n):
Yield successive n-sized chunks from l.
for i in range(0, len(l[0][0]), n):
if(i+n len(l[0][0])):
yield X_sample.narrow(2, i, n)
The following piece of code
- downsamples the Audio
- Creates Mel_Spectograms and takes the log of it
- Applies a Cepstral Mean and Variance Normalization
- Then it cuts the spectogram with the code above into a fixed size of length and appends it to an array
for index, row in df.iterrows():
wave_form, sample_rate = torchaudio.load(row[path], normalization=True)
downsample_resample = torchaudio.transforms.Resample(
sample_rate, downsample_rate, resampling_method='sinc_interpolation')
wav = downsample_resample(wave_form)
mel = torchaudio.transforms.MelSpectrogram(downsample_rate)(wav)
mellog = np.log(mel + 1e-9)
X_sample = speechpy.processing.cmvnw(mellog.squeeze(), win_size=301, variance_normalization=True)
X_sample = torch.tensor(X_sample).unsqueeze(0)
_min = min(np.amin(X_sample.numpy()),_min)
_max = max(np.amax(X_sample.numpy()),_max)
for chunked_X_sample in list(chunks(X_sample, max_total_context)):
if len(chunked_X_sample[0][0]) == max_total_context:
My question: Is this the common way to create features for deep learning? Do you have any suggestions to optimize this code? Furthermore I'm not sure if it is right to split the melspectograms instead of splitting the audio earlier.
Topic audio-recognition preprocessing feature-extraction python
Category Data Science