Scaling and handling highly correlated features in tabular data for regression
I am working on a regression problem trying to predict a target variable with seven predictor variables. I have a tabular dataset of 1400 rows. Before delving into the machine learning to build a predictor, I did an EDA(exploratory data analysis) and I got the below correlation coefficients (Pearson r) in my data. Note that I have included both the numerical predictor variables and the target variable.
I am wondering about the the following questions:
- We see that
pv3is highly correlated topv6,pv7,pv4andpv5. Is it a good strategy perhaps to leave outpv6? - Can we make any other obvious inferences from this heatmap?
- Another piece of domain information I have is that
pv7is a renormalization of the target. But its correlation is only0.42. Why is this the case? I have not scaled or normalised any of the data columns. I do see that the scale ofpv7and the scale oftargetare way different. Perhaps I should be scaling all the numerical columns before I make compute the correlations?
Looking forward to an interesting discussion here. Thanks in advance to this great community.
