Testing accuracy very low, while training and validation accuracy ~ 85%

I have a training dataset of 10000 pictures and a test dataset of 15000 pictures. There are 23 types of birds.

First of all, I imported the necessary

import tensorflow as tf 
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator 
from tensorflow.keras import layers 
from tensorflow.keras import Model 
import matplotlib.pyplot as plt

from keras.applications.inception_v3 import InceptionV3, preprocess_input

batch_size = 32
IM_WIDTH, IM_HEIGHT = 150, 150 # fixed size for inceptionV3
nb_epochs = 13

train_dir = '/kaggle/output/working_directory/'

I am using ImageDataGenerator for Image augmentation

#test_datagen = ImageDataGenerator(rescale = 1.0/255.)
test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)

train_datagen = ImageDataGenerator(
            rotation_range = 40, 
            width_shift_range = 0.2, 
            height_shift_range = 0.2,
            shear_range = 0.2, 
            zoom_range = 0.2, 
            horizontal_flip = True,
            validation_split=0.2) # set validation split

And importing data using flow_from_directory

train_generator = train_datagen.flow_from_directory(train_dir, 
                                                    batch_size = batch_size, 
                                                    class_mode = 'categorical', 
                                                    target_size = (IM_WIDTH, IM_HEIGHT),

validation_generator = train_datagen.flow_from_directory(train_dir, 
                                                              batch_size = batch_size, 
                                                              class_mode = 'categorical', 
                                                              target_size = (IM_WIDTH, IM_HEIGHT),

test_generator = test_datagen.flow_from_directory(
    directory = '/kaggle/input/test/',
    target_size = (IM_WIDTH, IM_HEIGHT),
    color_mode = 'rgb',
    batch_size = 1,
    class_mode = None,
    shuffle = False)

Found 8225 images belonging to 23 classes.

Found 2045 images belonging to 23 classes.

Found 15009 images belonging to 1 classes.

Finally, I imported the actual model

from tensorflow.keras.applications.inception_v3 import InceptionV3
base_model = InceptionV3(input_shape = (IM_WIDTH, IM_HEIGHT, 3), include_top = False, weights = 'imagenet')

for layer in base_model.layers:
    layer.trainable = True

import keras
from tensorflow.keras.optimizers import RMSprop

x = layers.Flatten()(base_model.output)
x = layers.Dense(1024, activation='relu')(x)
x = layers.Dropout(0.4)(x)
x = layers.Dense(23, activation='softmax')(x)

model = tf.keras.models.Model(base_model.input, x)

model.compile(optimizer = keras.optimizers.Adam(lr=0.0001), loss = 'categorical_crossentropy', metrics = ['acc'])

from keras.callbacks import ModelCheckpoint
from keras.callbacks import EarlyStopping

filepath = 'best_model.h5'

es = EarlyStopping(monitor='val_acc', 

ModelCheckpoint = ModelCheckpoint(filepath,

callbacks_list = [ModelCheckpoint, es]

inception = model.fit(train_generator, 
                      steps_per_epoch = train_generator.samples // batch_size,
                      validation_data = validation_generator,
                      validation_steps = validation_generator.samples// batch_size,
                      epochs = nb_epochs,
                      callbacks = callbacks_list)

Epoch 00012: val_acc did not improve from 0.86210 Epoch 13/13 257/257 [==============================] - 91s 355ms/step - loss: 0.2282 - acc: 0.9288 - val_loss: 0.5141 - val_acc: 0.8676

Epoch 00013: val_acc improved from 0.86210 to 0.86756, saving model to best_model.h5

Now, testing:

from keras.models import load_model

model = load_model('best_model.h5')


y_pred = model.predict(test_generator,
                       steps = STEP_SIZE_TEST)

predictions = [np.argmax(pred) for pred in y_pred]

prediction = pd.DataFrame(predictions, columns=['label']).to_csv('prediction.csv')

After I submit the .cvs file, the accuracy is 4.5%. I am very confused as validation data returns approx. 85% and it is not compromised, the model is not training on validation data. Hence, I am very confused why does my model achieve only 4.5% on the testing dataset. I believe there is something wrong with .prediction and storing the predicted values, but I cannot figure it out.

I believe this could help someone. The problem was that the output classes were randomly assigned. My classes are called: 0,1,2,3,4...,22. However, DataGenerator assigned output '5' to class 13, output '7' to class 15, and so on. Hence, the classes were shuffled. It is important to assign the output to each class.


