When using Data augmentation is it ok to validate only with the original images?

I'm working on a multi-classification deep learning algorithm and I was getting big over-fitting:

My model is supposed to classify sunglasses on 17 different brands, but I only had around 400 images from each brand so I created a folder with data augmented x3 times, generating images with these parameters:

datagen = ImageDataGenerator(
    rotation_range=30,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest')

After doing so i got these results:

I don't know if it's correct to do the validation only using the original images or if I have to use also the augmented images for the validation, also is strange for me to get higher accuracy on the validation than the training.

Topic data-augmentation image-recognition image-classification deep-learning neural-network

Category Data Science


I just want to point out that, though generally, we do not apply data augmentation to the validation data, there is a technique called "test time augmentation". In its simplest form, we make several predictions for each image, some of which are augmented (the same way as training) before making predictions. Then, we ensemble these predictions to make the final prediction.


You don't need to validate using data augmentation. You are using data aug only for training (because you don't have enough data). If you had much data then there was no point in data augmentation.

And you need data aug to reduce overfitting, there are other methods of reducing overfitting like dropout.


You should validate only on the original images. The augmentation is there so that it can help your model generalize better, but to evaluate your model you need actual images, not transformed ones.

To do this in keras you need to define two instances of the ImageDataGenerator, one for training and one for validating. To train the model you need to set both generators to the fit_generator function.

train_gen = ImageDataGenerator(aug_params).flow_from_directory(train_dir)
valid_gen = ImageDataGenerator().flow_from_directory(valid_dir)

model.fit_generator(train_gen, validation_data=valid_gen)

It is possible to achieve a higher validation accuracy than a train accuracy if you heavily augment the training data.


Ideally, data augmentation is a step in your training pipeline, which comes after splitting your data into train/validation/test sets. Otherwise, you have the same data point in both training and testing, even if it a little rotated.

So your training pipeline could be something like this:

          +-> training set ---> data augmentation --+
          |                                         |
          |                                         +-> model training --+
          |                                         |                    |
all data -+-> validation set -----------------------+                    |
          |                                                              +-> model testing
          |                                                              |
          |                                                              |
          +-> test set --------------------------------------------------+

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.