Is test data required to be transformed by training data statistics?

Question

Is test data required to be transformed by training data statistics?

Apollonia Vitelli

2022年5月30日 13:54

I am using a dataset (from literature) to build an MLP and classify real-world samples (from wetlab experiment) using this MLP. The performance of MLP on the literature dataset are well enough. I am following standard preprocessing procedure, where, after splitting, I firstly standardize my training data with fit_transform and then the testing data with transform so that I ensure I use only training data statistics (mean and std) to standardize unseen data against those mean and std.

However, when I predict the experiment samples after standardizing them with again training's mean and std, the predictions are very bad. Then, I tried standardizing the experimental data alone by fit_transform, which turned out great in prediction. This situation did not change even when we changed the literature data to another (we thought our experimental data was not compatible with the literature one first). It performed terribly when we standardize the experimental data with training's statistics, and very well when we used its own mean and std to do so.

So, my question is: Is it really necessary to standardize an unseen data (whether it be testing or real-world data)? For example, it strictly is a standard procedure to perform splitting before standardization. Is it something like this? I know that since the testing and training data are drawn from the same distribution, it should be OK to use only training's mean and std, however, in my case, sample sizes are pretty low and now I am not sure I should use this method even for my testing data let alone for real-world data.

[![enter image description here][3]][3][![enter image description here][4]][4][![enter image description here][5]][5]

Another question of mine is: Why does the latter work? Why the model classifies the real-world data very well (on 6 classes!) when I transform the data by its own stats?

I tried all other normalization and scaling techniques on sklearn but it's all the same. I realize that, once the model is constructed, new data will come one by one, not as a matrix (so it's impossible to use the real-world's own stats, right?).

Please advise me in whether I can or cannot scale an unseen data using its own stats. Thank you!

Topic preprocessing feature-scaling classification python machine-learning

Category Data Science

Erwan · Accepted Answer · 2022年5月30日 13:54

The train/test split should always be done first, any other method leads to a risk of data leakage and consequently overestimated performance on the test set.

In the logic of supervised learning, the training is always done with a set of instances but the "testing" (whether for evaluation or for production) should be doable with any number of instances, including a single one.

Of course, the testing is usually done with a set of instances too (in particular for the reliability of the performance estimation), but the principle is that when the model has been trained, it can be applied to any instance and predict a single result. This assumption is broken if the result depends on the other instances in the test set. This is why the parameters needed for applying the model must be fixed at training.

What you describe is a symptom of a different problem: the training set is not representative enough. Clearly the training and set test don't have the same distribution, and this breaks an even more fundamental assumption of supervised learning. There are alternatives to strict supervised learning, especially with recent models, where a pretrained model can have an additional short training stage to adapt to new data (this is a different topic and I'm not really knowledgeable about it).

Is test data required to be transformed by training data statistics?

About