dataset split for image classification

Question

dataset split for image classification

Hello-experts

2022年6月5日 00:06

I am trying to do image classification for 14 categories (around 1000 images for each cat). And i initially created two folders for training and validation. In this case, do I still need to set a validation split or a subset in a code? or I can use the whole files as train_ds and val_ds by deleting them

Folder names in the training and validation directory are same.

data_dir = 'trainingdatav1'
data_val = 'Validationv1'



train_ds = tf.keras.preprocessing.image_dataset_from_directory(
                                                                data_dir,
                                                                validation_split=0.1, #is it required if I'm gonna use the whole folders and files for training?
                                                                subset=training,
                                                                seed=123,
                                                                image_size=(img_height, img_width),
                                                                batch_size=batch_size)


val_ds = tf.keras.preprocessing.image_dataset_from_directory(
                                                              data_val,
                                                              validation_split=0.8, #need to check
                                                              subset=validation,
                                                              seed=455,
                                                              image_size=(img_height, img_width),
                                                              batch_size=batch_size)


num_classes = 14

model = tf.keras.Sequential([
  layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
  
  layers.Conv2D(16, 3, padding='same', activation='softmax'),
  layers.MaxPooling2D(),
  
  layers.Conv2D(32, 3, padding='same', activation='relu'),  #from renu
  layers.MaxPooling2D(),
  
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Dropout(.2),             #prevent overfitting        

  layers.Flatten(),
  layers.Dense(128, activation='sigmoid'),
  layers.Dense(num_classes)
])

model.compile(optimizer='SGD', #adam
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.summary()

epochs=50
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

Another question is the overfitting issue - validation accuracy is not over 0.4 and val_loss is around 2.xxx. Suggestions from Stacexchange are:

Reduce the layers of the neural network.
Reduce the number of neurons in each layer of the network to reduce the number of parameters.
Add dropout and tune its rate.
Use L2 normalisation on the parameter weights and tune the lambda value.
If possible add more data for training.

Are there any other suggestions?

Topic validation overfitting image-classification dataset

Category Data Science

Luis Ezequiel Muñoz · Accepted Answer · 2021年4月16日 02:18

The problem with a pre-defined validation set is that it can lead to overfitting more easily: the primary purpose of a validation set is to detect overfitting and if you keep tuning your hyperparameters for your model using a fixed validation set every time you train, then your model hyperparameters may be overfitting to that specific validation set.

VRaina · Accepted Answer · 2021年4月16日 00:06

If you have already split your training and validation sets into separate directories then there is no need to technically do the splitting in your code. However, the problem with a pre-defined validation set is that it can lead to overfitting more easily: the primary purpose of a validation set is to detect overfitting and if you keep tuning your hyperparameters for your model using a fixed validation set every time you train, then your model hyperparameters may be overfitting to that specific validation set.

Another problem with a fixed validation set is that it prevents you from exploiting approaches such as K-fold cross validation, where you split your data into K groups and each each group takes its turn as a validation set during training (prevents hyperparameter overfitting to a specific validation set as discussed above)...

With regard to your question on overfitting, your list is very comprehensive. However, I would add Early Stopping to that list - this you may already be doing - but good to mention explicitly.

dataset split for image classification

About