Preprocessing for the final model to be deployed

Typically for a ML workflow, we import the data (X and y), split the X and y into train, valid and test, preprocess the data for train, valid and test(scale, encode, impute nan values etc), perform HP tuning and after getting the best model with the best HP, we fit the final model to the whole dataset (i.e. X and y).

Now the issue here is that X and y are not preprocessed as only the train, valid and test are preprocessed. So when fitting the final model on X and y, we'll be getting an error as we haven't encoded (and performed other preprocessing steps) X and y. How are we then supposed to train the final model on the whole dataset? Do we preprocess X and y before fitting the final model? And if so won't it lead to data leakage/ overfitting?

Any help will be much appreciated!

Topic data-leakage overfitting preprocessing python machine-learning

Category Data Science


In the new experiment the full data is the training set. There's no test set or validation set.

How are we then supposed to train the final model on the whole dataset? Do we preprocess X and y before fitting the final model?

Yes, and it's important to apply the exact same preprocessing method as was done on the original training set, now using the full data as training set. Any difference would invalidate the performance measured in the first experiment.

And if so won't it lead to data leakage/ overfitting?

The preprocessing steps are determined on the training set only, then the exact same steps can be applied to the test set (or validation set).

In this case there's no test set anymore, so there cannot be data leakage. Of course the original test set cannot be used anymore as a test set, since it's now part of the training set.

There might be some overfitting, but it's not related to using the full dataset. Of course the first experiment should be used to check for overfitting before using the full data as training set. Once the model is trained on the full data, there's no way to check for overfitting anymore (unless there's some additional unseen labelled data that can be used as test set).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.