What is the suggested way to create features (Mel-Spectograms) from speech signal for classification with ResNet?

Question

What is the suggested way to create features (Mel-Spectograms) from speech signal for classification with ResNet?

3r1c

2022年5月29日 16:04

At the moment I have this piece of code which cuts a Spectogram into fixed length tensors:

def chunks(l, n):
    Yield successive n-sized chunks from l.
    for i in range(0, len(l[0][0]), n):
        if(i+n  len(l[0][0])):
            yield X_sample.narrow(2, i, n)

The following piece of code

downsamples the Audio
Creates Mel_Spectograms and takes the log of it
Applies a Cepstral Mean and Variance Normalization
Then it cuts the spectogram with the code above into a fixed size of length and appends it to an array

for index, row in df.iterrows():
    #resample
    wave_form, sample_rate = torchaudio.load(row[path], normalization=True)
    downsample_resample = torchaudio.transforms.Resample(
    sample_rate, downsample_rate, resampling_method='sinc_interpolation')
    wav = downsample_resample(wave_form)
    mel = torchaudio.transforms.MelSpectrogram(downsample_rate)(wav)
    mellog = np.log(mel + 1e-9)
    X_sample = speechpy.processing.cmvnw(mellog.squeeze(), win_size=301, variance_normalization=True)
    X_sample = torch.tensor(X_sample).unsqueeze(0)
    _min = min(np.amin(X_sample.numpy()),_min)
    _max = max(np.amax(X_sample.numpy()),_max)
    for chunked_X_sample in list(chunks(X_sample,  max_total_context)):
        print(len(chunked_X_sample[0][0]))
        if len(chunked_X_sample[0][0]) == max_total_context:
            X.append(chunked_X_sample)
            y.append(row[y])

My question: Is this the common way to create features for deep learning? Do you have any suggestions to optimize this code? Furthermore I'm not sure if it is right to split the melspectograms instead of splitting the audio earlier.

Topic audio-recognition preprocessing feature-extraction python

Category Data Science

What is the suggested way to create features (Mel-Spectograms) from speech signal for classification with ResNet?

About