Treating highly correlated features to the label feature

We work on a dataset with 1k features, where some elements are temporal/non-linear aggregations of other features. e.g., one feature might be the salary s, where the other is the mean salary of four months (s_4_m). We try to predict which employees are more likely to get a raise by applying a regression model on the salary. Still, our models are extremly biased toward features like s_4_m, which are highly correlated to the label feature.
Are there best practices for removing highly correlated features to the target variable? arbitrarily defining threshuld seems wrong.
Any thoughts or experiences are very welcomed.

Topic correlation databases

Category Data Science


Removing the features that are highly correlated to the outcome makes no sense. Correlation means that you have predictive ability. You want a model that makes accurate predictions. Therefore, keep the good predictors!


Technically yes, for every feature you can calculate the correlation of the feature with the target and remove features above a particular threshold.

But the more important question is: should you do this? There are two possibilities:

  • Either the feature won't be available in a production dataset, in particular because in case its calculation requires some knowledge not available yet. For example. For example if the 4 months salary feature takes into account future months including when the employee actually gets a raise, then it's obvious that (1) this feature can be obtained before the employee gets the raise and (2) there's no point predicting a raise after it has been obtained. So in this case the feature should be removed, not because of the correlation but because the task is not meaningful with it. It's actually a case of data leakage: an information which is not supposed to be available is.
  • Either the feature is normally available (e.g. can be calculated) in a production dataset. In this case there's no good reason to remove the feature: why make the job harder for the model and obtain lower performance when a perfectly valid feature could be used?

So the main point is semantic: does the feature make sense for the intended task? A secondary point is whether there are too many features, this would be regular feature selection and typically the features with the lowest correlation with the target are removed.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.