Converting a speech recognition model from CNNs to GRUs
I am trying to convert the simple audio recognition example from TensorFlow to use GRUs instead of CNNs.
The idea is to classify an audio clip onto a set of 8 labels: ['go', 'down', 'up', 'stop', 'yes', 'left', 'right', 'no']
The original code builds a model as follows:
norm_layer = preprocessing.Normalization()
norm_layer.adapt(spectrogram_ds.map(lambda x, _: x))
model = models.Sequential([
layers.Input(shape=input_shape),
preprocessing.Resizing(32, 32),
norm_layer,
layers.Conv2D(32, 3, activation='relu'),
layers.Conv2D(64, 3, activation='relu'),
layers.MaxPooling2D(),
layers.Dropout(0.25),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(num_labels),
])
The input shape is (124, 129, 1)
- single channel spectrograms with 124 time steps and 129 frequency bins. The X data is the spectrogram and Y data is an integer label which is an index into the labels array.
I have tried converting the above into using GRUs as follows:
for spectrogram, _ in spectrogram_ds.take(1):
input_shape = spectrogram.shape #(None, spectrogram.shape[0], spectrogram.shape[1])
print('Input shape:', input_shape)
num_labels = len(commands)
model = models.Sequential([
layers.Input(shape=input_shape),
# Step 1: CONV layer (≈4 lines)
layers.Conv1D(196, 15, strides=2, input_shape=input_shape),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.Dropout(0.8),
# Step 2: First GRU Layer (≈4 lines)
layers.GRU(128, return_sequences=True, input_shape=input_shape, reset_after=True),
layers.Dropout(0.8),
layers.BatchNormalization(),
# Step 3: Second GRU Layer (≈4 lines)
layers.GRU(128, return_sequences=True, reset_after=True),
layers.Dropout(0.8),
layers.BatchNormalization(),
layers.Dropout(0.8),
# Step 4: Time-distributed dense layer (see given code in instructions) (≈1 line)
layers.TimeDistributed(layers.Dense(num_labels, activation = sigmoid))
])
The input shape in this case is (m, 124, 129). The Conv1D layer reduces the time steps from 124 to 55.
In this case, the X data is still the spectrogram data. For Y data, I had to replicate the label index over 55 time steps. So Y has shape: (m, 55, 1).
The training is done as follows:
model.compile(
optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'],
)
EPOCHS = 10
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=EPOCHS,
callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2),
)
The issue is that I am getting a very low accuracy for the GRU model compared to the CNNs. The GRU training is also very slow. My feeling is that I have not set up the model correctly - especially I am not sure if I have the Y data setup correctly for the GRU.
I'd appreciate any insights in setting up this model correctly. Thanks!
Topic gru tensorflow
Category Data Science