TensorFlow Speech Emotion Recognition Model gives same prediction for all inputs

Dataset used: RAVDESS (I've only used the audio only files)

Here's a sample after I've processed the data:

And the code for the label encoding:


#encode labels as ints
lb = LabelEncoder()

y_train = np_utils.to_categorical(lb.fit_transform(y_train))
y_test = np_utils.to_categorical(lb.fit_transform(y_test))

#Not sure if this is needed
x_train =np.expand_dims(x_train, axis=2)
x_test= np.expand_dims(x_test, axis=2)

Model:

    model.add(Conv1D(16, 5,strides=2 ,padding='same', input_shape=(259,1)))
    model.add(Conv1D(16, 5,padding='same', activation=relu))
    model.add(Dropout(0.1))
    model.add(MaxPooling1D(pool_size=(6)))
    model.add(LSTM(1))
    model.add(Flatten())
    model.add(Dense(10, activation=relu))
    model.add(Dense(10,activation=softmax))
    
    model.summary()
    
    opt = keras.optimizers.RMSprop(lr=0.00001, decay=1e-6)

    model.compile(metrics=['accuracy'], optimizer=opt, loss='categorical_crossentropy')

    history = model.fit(x_train, y_train, batch_size=1,epochs=15, validation_data=(x_train, y_train))

When I call model.predict() with the test data the model gives the same exact answer for all the inputs. I have 10 classes so the accuracy basically always plateaus at 0.1399 or somewhere similar.

Here's a sample of what my model.predict() returns:

I've played around with the model for quite a bit now, but I just can't get these results to change.

Any ideas what I could do?

Topic audio-recognition keras tensorflow neural-network python

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.