Feature engineering before splitting
This is a sister post to the original closed post (here). Since the data transformation part is done after data spliting on the TRAINING data only, I wonder wouldn't such transformation has dependency with how we subsample our data? We can have different transformation results when we pick different portion of training data.
But I personally find it hard to convince myself that: isn't data transformation should be as invariant and generalizable as possible, across different subsamplings of dataset?
Also, as the testing part of the data also represent the real world data. Shouldn't we just transform the data before splitting can we capture more about how the data in 'real world' looks like, and we waste no data? While I accept we only transform the training set and reapply the same on the testing set during prediction in model evaluation/training phrase, isn't it better if during actual deployment we execute transformation on the whole dataset instead and train on all data instead of just sticking to the 'post-splitting transformation' from the model training phrase?
Specifically, for example, if I apply LabelEncoder() from sklearn on the train set, and then I use a new instance of LabelEncoder() on the full dataset, this is legit?
TIA.
Topic transformation features feature-engineering feature-selection
Category Data Science