Why leaky features are problematic

I want to know why leaky features are problematic in machine learning/data science. I'm reading a book that uses the Titanic dataset for illustration. It says that the column body (Body Identification Number) leaks data since if we are creating a model to predict if a passenger would die, knowing that they had a body identification number a priori would let us know they were already dead.

Logical-wise, this makes sense. But assuming I don't have any knowledge about this dataset, I will just keep the body feature and build a model, say RandomForestClassifier. Even though later on I discover it leaks data, so what? As long as my test set has this column, the model still runs and still give me a prediction (a very good prediction, indeed).

UPDATE I followed a linked thread here, Why does my model produce too good to be true output? and got some more thoughts. Let's assume a very hypothetical situation: someone wants to mess with my predictions by purposely attacking my data source and alter the data in the body column. In this case, I can see that dropping it before model training makes sense since this completely fools the models. This got me into thinking that data governance is equally important, perhaps even more important, than building a good ML model.

But I rarely see this practice in real-world projects. Usually, the expectation is: given the data, train the best model. These datasets typically have thousands of feature and are already prepared by data engineers. It's very likely that they accidentally include features that make validation scores unreliably high, even if one applies tons of advanced validation techniques. This seems to make it hard for data scientists to properly train and validate models because the only way then is to comb through the data generation process, which might be perceived as unproductive.

Topic data-leakage

Category Data Science


Yes, the argument you are giving is perfectly valid. But Let look at two different scenarios and see how it does not benefit in real world:

  1. If you already have a variables which perfectly predicts the target variable why would you spend your resources on creating a ML model. Remember the objective of ML model should not be to good on Testing Data but when its deployed in production it should perform well.

  2. In real world, this leaky variable leads to bad model prediction. In some cases this variable may not be present hence your model pipeline and model performace will drastically reduce.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.