Converting a speech recognition model from CNNs to GRUs

I am trying to convert the simple audio recognition example from TensorFlow to use GRUs instead of CNNs.

The idea is to classify an audio clip onto a set of 8 labels: ['go', 'down', 'up', 'stop', 'yes', 'left', 'right', 'no']

The original code builds a model as follows:

norm_layer = preprocessing.Normalization()
norm_layer.adapt(spectrogram_ds.map(lambda x, _: x))

model = models.Sequential([
    layers.Input(shape=input_shape),
    preprocessing.Resizing(32, 32), 
    norm_layer,
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(64, 3, activation='relu'),
    layers.MaxPooling2D(),
    layers.Dropout(0.25),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_labels),
])

The input shape is (124, 129, 1) - single channel spectrograms with 124 time steps and 129 frequency bins. The X data is the spectrogram and Y data is an integer label which is an index into the labels array.

I have tried converting the above into using GRUs as follows:

for spectrogram, _ in spectrogram_ds.take(1):
  input_shape = spectrogram.shape #(None, spectrogram.shape[0], spectrogram.shape[1])
print('Input shape:', input_shape)
num_labels = len(commands)

model = models.Sequential([
    layers.Input(shape=input_shape),

    # Step 1: CONV layer (≈4 lines)
    layers.Conv1D(196, 15, strides=2, input_shape=input_shape),          
    layers.BatchNormalization(),                                         
    layers.Activation('relu'),                
    layers.Dropout(0.8),                       

    # Step 2: First GRU Layer (≈4 lines)
    layers.GRU(128, return_sequences=True, input_shape=input_shape, reset_after=True),      
    layers.Dropout(0.8),                                           
    layers.BatchNormalization(),                                   
    
    # Step 3: Second GRU Layer (≈4 lines)
    layers.GRU(128, return_sequences=True, reset_after=True), 
    layers.Dropout(0.8),                      
    layers.BatchNormalization(),              
    layers.Dropout(0.8),                     

    # Step 4: Time-distributed dense layer (see given code in instructions) (≈1 line)
    layers.TimeDistributed(layers.Dense(num_labels, activation = sigmoid)) 

])

The input shape in this case is (m, 124, 129). The Conv1D layer reduces the time steps from 124 to 55.

In this case, the X data is still the spectrogram data. For Y data, I had to replicate the label index over 55 time steps. So Y has shape: (m, 55, 1).

The training is done as follows:

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'],
)

EPOCHS = 10
history = model.fit(
    train_ds, 
    validation_data=val_ds,  
    epochs=EPOCHS,
    callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2),
)

The issue is that I am getting a very low accuracy for the GRU model compared to the CNNs. The GRU training is also very slow. My feeling is that I have not set up the model correctly - especially I am not sure if I have the Y data setup correctly for the GRU.

I'd appreciate any insights in setting up this model correctly. Thanks!

Topic gru tensorflow

Category Data Science


Looking more, the coursera article and proposed architecture is excellent, i would start with it as-is. The tensorflow dataset is also valuable (impressed by its public existence).

Summary: The GRU network should be fed with sequencial 1D vectors (row-by-row capturing the event evolving in time t,t+1,t+2) but the tensorflow's classical CNN with whole 2D snapshots from the spectra. That's the difference.

Not familiar with tensorflow's "model = models.Sequential" way of feeding i use pytorch / mxnet and always feed networks with own routines (including needed little preprocessing) as a good way of exercise avoiding "smart single line" data-feeders exposed in many modern DL framework.


The article [coursera] use properly the GRU so, these changes would be neccesary:

  1. Architecture: Use only one GRU layer having in front the Conv1D not Conv2D (super trivial).

  2. The input should be a fixed length 1D vector, (single row extracted from the 2D spectra at t0, vector of 5511 in the article). Then feed the vector from next row at t+1, then next row t+2 and so on in little sequential steps as spectra evolve in time over the very audio sample. They used 101 rows in article (referring to rows from spectra as "timesteps").

  3. If learning starts and yield results can experiment with 2 GRU deep one, but start with one layer, RNN are much harder to train in deep/stacked form.

RNNs are also hard to quantize and accelerate (sequential nature). Having them on hardware especially on edge low resource ones is difficult. Classic 2D CNN are better choice even for audio or other sequential data types.


I think your network's design is fine but need change the way you feed training data.

In essence you try convert a classic CNN (having "image" data, but spectrograms) to a sequential time-domain type of network (recurrent one, having GRU or LSTM at the heart).

To be able to do that you have to reconsider the way you feed the input data:

  1. Input data should not be the "image" snapshots of spectra (as expression of audio spectra in a time-window) but instead have to feed evolving sequences like text ones (the .wav audio samples in sequential way).

  2. If still would like to feed with spectrograms and not raw sequential wave samples, try feed the spectrograms by cloning each of them into time evolving variants having offset by t, t+1, t+2, t+N variants of same audio snapshot (recompute FFT as many times) thus recurrent GRU unit may have chance to capture sequence information out if it.

Also:

  • BatchNorm & Relu was added, a good way to improve generalization and overfit, at the price that might requires more data to see benefits.
  • Pooling & Flatten summarisation was replaced by GRU recurrent unit, but these units learns only when capturing from time evolving data.
  • Very Conv2d in front of GRUs helps generalization but Conv2d x times x GRU also means lots of combinatoric space opened up, implict might require more input data or prolonged training.

Anyway is a fun experiment worth trying !

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.