Same validation accuracy, different train accuracy for two neural networks models

I'm performing emotion classification over FER2013 dataset. I'm trying to measure different models performance, and when I checked ImageDataGenerator with a model I had already used I came up with the following situation:

Model without data augmentation got:

  • train_accuracy = 0.76
  • val_accuracy = 0.70

Model with data augmentation got:

  • train_accuracy = 0.86
  • val_accuracy = 0.70

As you can see, validation accuracy is the same in both models, but train accuracy is significantly different. In this case:

  • Should I go with the model which uses data augmentation, as it's train accuracy is higher?
  • Should I expect overfitting from it, and choose the model without data augmentation as it's accuracy values are closer?
  • Third option. Should I perform more comprobations? If so, which ones?

Thanks for your time.

Topic data-augmentation model-selection accuracy neural-network

Category Data Science


Given two models that have the same out-of-sample performance but different in-sample performance, I would go with the simpler model. In other words, you get non-performance gain by going with the second model, but you have a drawback of greater complexity, perhaps even worse overfitting, and increased computing time.

However, accuracy is a flawed metric!

Compare the two models on their cross-entropy loss that I suspect you’re using to optimize them. Cross-entropy loss is a strictly proper scoring rule, while accuracy is not.

Please read this linked post on Cross Validated, the statistics Stack, and read the linked posts on Frank Harrell’s blog.

https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models/312787#312787

I also have a post over there about how to talk to your boss about using a proper scoring rule instead of accuracy.

https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email


Based on this, both models generalize equally well. However both have overfit, the second one more significantly. You simply want to avoid that. For example, I'd expect the best validation loss to be closer to training loss if you're using early stopping. You can turn up things like dropout.

If you do so, I think you'll find the augmented model ends up producing a better model (lower validation loss).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.