Correlation with target variable for regression problem

Given the following dataframe

   age       job  salary
0    1    Doctor     100
1    2  Engineer     200
2    3    Lawyer     300    
...

with age as numeric, job as categorical, I want to test the correlation with salary, for the purpose of selecting the features (age and/or job) for predicting the salary (regression problem). Can I use the following API from sklearn (or other api)

sklearn.feature_selection.f_regression
sklearn.feature_selection.mutual_info_regression

to test it? If yes, what's the right method and syntax to test the correlation?

Following is the code of the dataset

df=pd.DataFrame({age:[1,2,3],job:[Doctor,Engineer,Lawyer],salary:[100,200,300]})

Topic spearmans-rank-correlation pearsons-correlation-coefficient correlation scikit-learn feature-selection

Category Data Science


You can use mutual_info_regression for both continuous and discrete features. But first encode your job manually or using LabelEncoder. Then get MI scores by specifying the index of job (here, $1$):

X, y = df.drop(columns='salary'), df['salary']
X['job'] =  # encode this feature
mutual_info_regression(X, y, discrete_features=[1])

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.