When should mutual information be used for feature selection over other feature selection methods like correlation, ANOVA , etc?

Question

When should mutual information be used for feature selection over other feature selection methods like correlation, ANOVA , etc?

Ankita Talwar

2020年6月18日 06:23

I have a data set with categorical and continuous/ordinal explanatory variables and continuous target variable. I tried to filter features using one-way ANOVA for categorical variables and using Spearman's correlation coefficient for continuous/ordinal variables.I am using p-value to filter. I then also used mutual information regression to select features.The results from both the techniques do not match. Can someone please explain what is the discrepancy and what should be used when ?

Topic spearmans-rank-correlation anova mutual-information feature-selection machine-learning

Category Data Science

10xAI · Accepted Answer · 2020年6月18日 06:23

Pearson correlation and ANOVA will measure the linear variance between x,y.
A mutual information regression method is an entropy-based method which can find even non-linear correlation.

Let's see 2 pair of data (x, y_1) and (x, y_2).

##0. Data setup
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y_1 = 3.39*x_1
y_2 = np.array([1, 1, -1, -1, 0, 0, -2, -2, 2, 2])
df_1 = pd.DataFrame([x,y_1]).T
df_2 = pd.DataFrame([x,y_2]).T

##1. Pearson corr
df_1.corr()
df_2.corr()

##2. f_test
from sklearn.feature_selection import f_regression, mutual_info_regression
f_test, _ = f_regression(x.reshape(10, 1),y_1)
f_test
f_test, _ = f_regression(x.reshape(10, 1),y_2)
f_test

##3. Mutual reg
mutual_info_regression(x.reshape(10, 1),y_1,n_neighbors=1,random_state=123)
mutual_info_regression(x.reshape(10, 1),y_2,n_neighbors=1,random_state=123)

##4. Tree-based feature importance is also based on entropy reduction 
##and is more reliable as  KNN needs a good k value

We can observe that Mutual importance is able to catch the relation of x with not just y_1 when it is simple Linear bust also with y_2 when it is not linear(but has a strong relation)

You should use either MI if you are only looking for a filter-based method. Otherwise, Tree feature importance would be best for most of the cases.

Check this link - F_test Vs MI

When should mutual information be used for feature selection over other feature selection methods like correlation, ANOVA , etc?

About