Scaling and handling highly correlated features in tabular data for regression

Question

Scaling and handling highly correlated features in tabular data for regression

hAcKnRoCk

2022年5月14日 07:38

I am working on a regression problem trying to predict a target variable with seven predictor variables. I have a tabular dataset of 1400 rows. Before delving into the machine learning to build a predictor, I did an EDA(exploratory data analysis) and I got the below correlation coefficients (Pearson r) in my data. Note that I have included both the numerical predictor variables and the target variable.

I am wondering about the the following questions:

We see that pv3 is highly correlated to pv6,pv7,pv4 and pv5. Is it a good strategy perhaps to leave out pv6?
Can we make any other obvious inferences from this heatmap?
Another piece of domain information I have is that pv7 is a renormalization of the target. But its correlation is only 0.42. Why is this the case? I have not scaled or normalised any of the data columns. I do see that the scale of pv7 and the scale of target are way different. Perhaps I should be scaling all the numerical columns before I make compute the correlations?

Looking forward to an interesting discussion here. Thanks in advance to this great community.

Topic pearsons-correlation-coefficient preprocessing regression data-cleaning data-mining

Category Data Science

Scaling and handling highly correlated features in tabular data for regression

About