Preprocessing for the final model to be deployed
Typically for a ML workflow, we import the data (X
and y
), split the X
and y
into train
, valid
and test
, preprocess the data for train
, valid
and test
(scale, encode, impute nan values etc), perform HP tuning and after getting the best model with the best HP, we fit the final model to the whole dataset (i.e. X
and y
).
Now the issue here is that X
and y
are not preprocessed as only the train
, valid
and test
are preprocessed. So when fitting the final model on X
and y
, we'll be getting an error as we haven't encoded (and performed other preprocessing steps) X
and y
. How are we then supposed to train the final model on the whole dataset? Do we preprocess X
and y
before fitting the final model? And if so won't it lead to data leakage/ overfitting?
Any help will be much appreciated!
Topic data-leakage overfitting preprocessing python machine-learning
Category Data Science