Why label encoding before split is data leakage?

I want to ask why Label Encoding before train test split is considered data leakage?

From my point of view, it is not. Because, for example, you encode good to 2, neutral to 1 and bad to 0. It will be same for both train and test sets.

So, why do we have to split first and then do label encoding?

Topic test labelling data-leakage training preprocessing

Category Data Science


Imagine that after the split there is no "good" in the training data. If you had done the encoding after the split, then you would have no idea that there can be a "good". There you have your leakage.

Of course, as you mention in the comments, this is a problem. Nevertheless, this problem is just the real world, where we do not have perfect information about the data that our system will be fed in production. That is why we must evaluate our model on unseen data. If you split after encoding, you are evaluating your model under a false premise of knowledge about that very unseen data.


If you perform the encoding before the split, it will lead to data leakage (train-test contamination) In the sense, you will introduce new data (integers of Label Encoders) and use it for your models thus it will affect the end predictions results (good validation scores but poor in deployment).

Suppose test data has new class which was not available in train data but you do label encoding it will be available in the model which leads to data leakage

After the train and validation data category already matched up, you can perform fit_transform on the train data, then only transform for the validation data - based on the encoding maps from train data.

Almost all feature engineering like standarisation, Normalisation etc should be done after train testsplit. Hope it helps

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.