Treating highly correlated features to the label feature
We work on a dataset with 1k
features, where some elements are temporal/non-linear aggregations of other features.
e.g., one feature might be the salary s
, where the other is the mean salary of four months (s_4_m
).
We try to predict which employees are more likely to get a raise by applying a regression model on the salary. Still, our models are extremly biased toward features like s_4_m
, which are highly correlated to the label feature.
Are there best practices for removing highly correlated features to the target variable? arbitrarily defining threshuld
seems wrong.
Any thoughts or experiences are very welcomed.
Topic correlation databases
Category Data Science