Preprocessing for the final model to be deployed

Question

Preprocessing for the final model to be deployed

spectre

2021年11月29日 23:35

Typically for a ML workflow, we import the data (X and y), split the X and y into train, valid and test, preprocess the data for train, valid and test(scale, encode, impute nan values etc), perform HP tuning and after getting the best model with the best HP, we fit the final model to the whole dataset (i.e. X and y).

Now the issue here is that X and y are not preprocessed as only the train, valid and test are preprocessed. So when fitting the final model on X and y, we'll be getting an error as we haven't encoded (and performed other preprocessing steps) X and y. How are we then supposed to train the final model on the whole dataset? Do we preprocess X and y before fitting the final model? And if so won't it lead to data leakage/ overfitting?

Any help will be much appreciated!

Topic data-leakage overfitting preprocessing python machine-learning

Category Data Science

Erwan · Accepted Answer · 2021年11月29日 23:35

In the new experiment the full data is the training set. There's no test set or validation set.

How are we then supposed to train the final model on the whole dataset? Do we preprocess X and y before fitting the final model?

Yes, and it's important to apply the exact same preprocessing method as was done on the original training set, now using the full data as training set. Any difference would invalidate the performance measured in the first experiment.

And if so won't it lead to data leakage/ overfitting?

The preprocessing steps are determined on the training set only, then the exact same steps can be applied to the test set (or validation set).

In this case there's no test set anymore, so there cannot be data leakage. Of course the original test set cannot be used anymore as a test set, since it's now part of the training set.

There might be some overfitting, but it's not related to using the full dataset. Of course the first experiment should be used to check for overfitting before using the full data as training set. Once the model is trained on the full data, there's no way to check for overfitting anymore (unless there's some additional unseen labelled data that can be used as test set).

Preprocessing for the final model to be deployed

About