Dropping highly correlated features

Question

Dropping highly correlated features

jared

2022年5月1日 19:59

I am making my classification project and I have this situation after using seaborn heatmap.

Column 0 is my target, where I have data with 3 classes. To my knowledge I should remove column highly correlated with target value. My questions are:

Should I remove also features highly correlated with features different than target? For example I can see very high correlation between column 27 and 46. Should I remove one of them?
This is correlation heat map from my whole dataset. Should I examine correlation and consider dropping columns in this way or should I make it only for train set and test set left without dropping any column? Logic dictates that making this transformations is better only on train set, but I prefer to be sure.

Topic heatmap correlation classification machine-learning

Category Data Science

Ben Reiniger · Accepted Answer · 2022年5月1日 14:50

Q0:

To my knowledge I should remove column highly correlated with target value.

You should not remove a feature just because it is highly correlated with the target! A high correlation means it will likely be a very useful feature. You should check that such features are "allowable" in your final use: that it will actually be available at prediction time for production data (it wasn't future information), etc.

Q1: Maybe. See Dave's answer linked in the comments. Interpretability is hard with highly correlated features, but pure predicting is not necessarily. I think the underlying mechanism may be whether the difference between the features is noise or signal. Try removing one, but also try regularization, maybe whitening?

Q2: The analysis of which columns to drop should absolutely be done only on the training set. When trying the removal approach, then drop the same columns from the test set for scoring. Your scoring is a measure of goodness of the entire pipeline, column removal included.

Dropping highly correlated features

About