Test data accuracy from real world have lowest accuracy than validation data collected in simulation environment

Background:

Problem type: Multi class classification

The dataset contains around 1,000 samples (simulated dataset of sensor signals), where each sample is 2D i.e (1000 * 1000 * 8). Additionally, I have a few real world data, which is of shape (100 * 1000 * 8)

I split my data into training and validation from the simulated data set and use the real world data as test set.

I performed cross validation with 5 folds + data augmentation techniques since I have low samples on train set, which also takes care of data imbalance.

I built a neural network architecture (CNN), but evaluated my model using accuracy metrics.

Problem:

The validation data performs really well around 85% accuracy, where as my test data outputs only 60% accuracy.

What does this mean? Why is my model not performing well on the test data set (real world data set)?

I found a similar question asked before, but I am not able to understand what exactly the accepted answer is explaining. So I am posting this question, once again, to gain more insight into the problem and required actions.

Thank you...

Topic sensors cnn deep-learning accuracy neural-network

Category Data Science


The thing is your sample data (collected from simulation) and real world data comes from different statistical distribution. So, When your training and validation data comes from same distribution it happens to work well as your trained model learns the underlying distribution of your simulated dataset from the training dataset. But when you try to test the model on a different dataset, i.e the real world dataset, then the model barely knows anything about the distribution of real world dataset, hence performs poorly.

Solution: Don't just split simulated dataset into train and validation , then infer on real world data. Rather split both real world data and simulated data in train validation and test set. For example, if you have 1000 simulated data and 100 real world data. You train set should be (assuming an 80-10-10 split) 800 simulated data and 80 real world data, your validation set would be 100 simulated data and 10 real world data and so on.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.