Drawing validation set from test set

Question

Drawing validation set from test set

U. User

2022年4月25日 07:03

I am building a 3 neural network models on dataset that is already separated to train and test sets. From my analysis, I found that this dataset has values on test set which don't exist in the train set. And this gives a certain limitation or maximum capacity to my neural network model(s). By this I mean, I can not seem to improve the accuracy even if I change the hyper parameters or the parameters of my models.

I have created 3 neural networks models and varied almost everything:

Number nodes/hidden layers,
Input features (performed feature selection and space reduction),
Activation functions and loss functions,
regularization, optimizer and more,

When I try to average the predictions of the 3 models, I don't see any improvements. Although I've read a lot that if I change such parameters I might have some uncorrelated models. But this wasn't the case for me because I always find correlation between my model predictions when I compute Pearson Correlation

After building all these models, I am pretty sure that the training set and test set are not drawn from the same distribution (i.e. they are not a random split of some full original dataset), which means that other features probably also have a different distribution.

Some proposed I could merge the training+test, but I don't want to do that as this dataset was developed in this way. But I would like to draw my validation set from the test set, is this possible? Can I use a validation set randomly drawn from the test set to tune models?

Topic ensemble-modeling correlation cross-validation neural-network python

Category Data Science

Yohanes Alfredo · Accepted Answer · 2019年11月9日 00:34

I don't think it is justified to train with test set. What is more justified is augmenting the train set with external data given that it is available(If this is the case please cite external data when producing your report especially for academic reporting).

If your data is tabular you can try working on some feature engineering or improve on your preprocessing method. If your data is image you can try adding external (anything except your own dataset test set) apply unique data augmentation ideas.

As long as your network does not overfit and able to generalize well this should not be an issue. I assume that the result is going to be compared within academic scope. If this is the case do not worry too much since other people might also face the same issue. If you are able to solve it that is great since that imply that you are able to design network that is capable to generalize very well, but if not as long as you are able to explain the issue you faced, less desirable result is quite understandable.

Michael Corley MBA LSSBB · Accepted Answer · 2019年10月8日 19:56

Forget that you are working with a neural network for a moment. Hopefully you are also taking into account time. If you were performing a ordinary regression and time was a component among others, you would have to apply an extrapolation penalty to your confidence intervals to penalize your model for deviated from the observed range.

Another possibility is that you have an intervention in your dataset. Meaning, it is possible that something occurred and what you need to truly do is a variance test between the training and the sample data.

Traditional regression handles minimization of variance around the mean, but there is something called median regression as well, which is design for this particular type of issue, when there are problems with dispersion in the model.

Drawing validation set from test set

About