Feature engineering before splitting

This is a sister post to the original closed post (here). Since the data transformation part is done after data spliting on the TRAINING data only, I wonder wouldn't such transformation has dependency with how we subsample our data? We can have different transformation results when we pick different portion of training data.

But I personally find it hard to convince myself that: isn't data transformation should be as invariant and generalizable as possible, across different subsamplings of dataset?

Also, as the testing part of the data also represent the real world data. Shouldn't we just transform the data before splitting can we capture more about how the data in 'real world' looks like, and we waste no data? While I accept we only transform the training set and reapply the same on the testing set during prediction in model evaluation/training phrase, isn't it better if during actual deployment we execute transformation on the whole dataset instead and train on all data instead of just sticking to the 'post-splitting transformation' from the model training phrase?

Specifically, for example, if I apply LabelEncoder() from sklearn on the train set, and then I use a new instance of LabelEncoder() on the full dataset, this is legit?

TIA.

Topic transformation features feature-engineering feature-selection

Category Data Science


Yes thats what most data scientists do in the industry. They divide their train & test dataset to find the best model and what works for them. Once they know which model and preprocessing works for them. They apply the same preprocessing and retrain the model with best hyperparams on the whole dataset. So you are thinking in the right direction and thats what is used a lot in the industry.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.