ValueError: Error when checking input: expected the_input to have 3 dimensions, but got array with shape (14174, 1)

hope you're all doing good !

I am working on Automatic Speech Recognition with Python with the LibriSpeech Dataset. After preprocessing the audios data and applying an MFCC featurizing I append everything into a list and get a shape of (14174,). Knowing that each sample has a different length but the same number of features for example :

print(X[0].shape)
print(X[12000].shape)
 (615, 13)
 (301, 13)

Now when I feed the data into my network with an Input layer defined as

input_data = Input(name='the_input', shape=(None, input_dim)) # with input_dim = 13 MFCC features

I get the following error

ValueError: Error when checking input: expected the_input to have 3 dimensions, but got array with shape (14174, 1)

I tried reshaping with different shapes but I am still struggling.

This is the model

def final_model(input_dim, units, output_dim=29):
     Build a bidirectional recurrent network for speech
    
    # Main acoustic input
    input_data = Input(name='the_input', shape=(None, input_dim))
    
    # =============== 1st Layer =============== #
    # Add bidirectional recurrent layer
    bidirectional_rnn = Bidirectional(GRU(units, activation=None,return_sequences=True, implementation=2, name='bidir_rnn'))(input_data)
    # Add batch normalization
    batch_normalization = BatchNormalization(name = batch_normalization_bidirectional_rnn)(bidirectional_rnn)
    # Add activation function
    activation = Activation('relu')(batch_normalization)
    # Add dropout
    #drop = Dropout(rate = 0.1)(activation)
    
    # =============== 2nd Layer =============== #
    # Add bidirectional recurrent layer
    bidirectional_rnn = Bidirectional(GRU(units, activation=None,return_sequences=True, implementation=2, name='bidir_rnn'))(activation)
    # Add batch normalization
    batch_normalization = BatchNormalization(name = bn_bidir_rnn_2)(bidirectional_rnn)
    # Add activation function
    activation = Activation('relu')(batch_normalization)
    # Add dropout
    #drop = Dropout(rate = 0.1)(activation)
    
    # =============== 3rd Layer =============== #
    # Add a TimeDistributed(Dense(output_dim)) layer
    time_dense = TimeDistributed(Dense(output_dim))(activation)
    # Add softmax activation layer
    y_pred = Activation('softmax', name='softmax')(time_dense)
    
    # Specify the model
    model = Model(inputs=input_data, outputs=y_pred)
    model.output_length = lambda x: x
    print(model.summary())
    return model

Thanks

Topic reshape speech-to-text lstm rnn

Category Data Science


Your inputs have different length so as suggested by @skrrrt, you should pad your data and apply a mask into your model.

The following pads all your input with 0. values so that all sequences have the same length.

from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_inputs = pad_sequences(X, padding="post", dtype='float')

You can choose which value to use for padding with the parameter value=0.0 (documentation)

Then, add a masking layer just after your input layer in your model.

    # Main acoustic input
    input_data = Input(name='the_input', shape=(None, input_dim))
    masked_input = Masking(mask_value=0.0)(input_data)
    bidirectional_rnn = Bidirectional(GRU(units, activation=None,return_sequences=True, implementation=2, name='bidir_rnn'))(masked_input)

Refer to this tutorial for more information.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.