Dropping highly correlated features

I am making my classification project and I have this situation after using seaborn heatmap.

Column 0 is my target, where I have data with 3 classes. To my knowledge I should remove column highly correlated with target value. My questions are:

  1. Should I remove also features highly correlated with features different than target? For example I can see very high correlation between column 27 and 46. Should I remove one of them?

  2. This is correlation heat map from my whole dataset. Should I examine correlation and consider dropping columns in this way or should I make it only for train set and test set left without dropping any column? Logic dictates that making this transformations is better only on train set, but I prefer to be sure.

Topic heatmap correlation classification machine-learning

Category Data Science


Q0:

To my knowledge I should remove column highly correlated with target value.

You should not remove a feature just because it is highly correlated with the target! A high correlation means it will likely be a very useful feature. You should check that such features are "allowable" in your final use: that it will actually be available at prediction time for production data (it wasn't future information), etc.

Q1: Maybe. See Dave's answer linked in the comments. Interpretability is hard with highly correlated features, but pure predicting is not necessarily. I think the underlying mechanism may be whether the difference between the features is noise or signal. Try removing one, but also try regularization, maybe whitening?

Q2: The analysis of which columns to drop should absolutely be done only on the training set. When trying the removal approach, then drop the same columns from the test set for scoring. Your scoring is a measure of goodness of the entire pipeline, column removal included.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.