Scaling and handling highly correlated features in tabular data for regression
I am working on a regression problem trying to predict a target variable with seven predictor variables. I have a tabular dataset of 1400 rows. Before delving into the machine learning to build a predictor, I did an EDA(exploratory data analysis) and I got the below correlation coefficients (Pearson r) in my data. Note that I have included both the numerical predictor variables and the target variable.
I am wondering about the the following questions:
- We see that
pv3
is highly correlated topv6
,pv7
,pv4
andpv5
. Is it a good strategy perhaps to leave outpv6
? - Can we make any other obvious inferences from this heatmap?
- Another piece of domain information I have is that
pv7
is a renormalization of the target. But its correlation is only0.42
. Why is this the case? I have not scaled or normalised any of the data columns. I do see that the scale ofpv7
and the scale oftarget
are way different. Perhaps I should be scaling all the numerical columns before I make compute the correlations?
Looking forward to an interesting discussion here. Thanks in advance to this great community.