Is test data required to be transformed by training data statistics?
I am using a dataset (from literature) to build an MLP and classify real-world samples (from wetlab experiment) using this MLP. The performance of MLP on the literature dataset are well enough. I am following standard preprocessing procedure, where, after splitting, I firstly standardize my training data with fit_transform
and then the testing data with transform
so that I ensure I use only training data statistics (mean and std) to standardize unseen data against those mean and std.
However, when I predict the experiment samples after standardizing them with again training's mean and std, the predictions are very bad. Then, I tried standardizing the experimental data alone by fit_transform
, which turned out great in prediction. This situation did not change even when we changed the literature data to another (we thought our experimental data was not compatible with the literature one first). It performed terribly when we standardize the experimental data with training's statistics, and very well when we used its own mean and std to do so.
So, my question is: Is it really necessary to standardize an unseen data (whether it be testing or real-world data)? For example, it strictly is a standard procedure to perform splitting before standardization. Is it something like this? I know that since the testing and training data are drawn from the same distribution, it should be OK to use only training's mean and std, however, in my case, sample sizes are pretty low and now I am not sure I should use this method even for my testing data let alone for real-world data.
[![enter image description here][3]][3][![enter image description here][4]][4][![enter image description here][5]][5]
Another question of mine is: Why does the latter work? Why the model classifies the real-world data very well (on 6 classes!) when I transform the data by its own stats?
I tried all other normalization and scaling techniques on sklearn but it's all the same. I realize that, once the model is constructed, new data will come one by one, not as a matrix (so it's impossible to use the real-world's own stats, right?).
Please advise me in whether I can or cannot scale an unseen data using its own stats. Thank you!
Topic preprocessing feature-scaling classification python machine-learning
Category Data Science