Chi-Squared test: ok for selecting significant features?

Question

Chi-Squared test: ok for selecting significant features?

V_sqrt

2021年2月8日 16:48

I would have a question on the contingency table and its results. I was performing this analysis on names starting with symbols as a possible feature, getting the following values:

Label          0.0  1.0     
with_symb      1584 241
without_symb     16 14

getting a p-value which lets met conclude that variables are associated (since it is less than 0.05). My question is if this result might be a good result based on the chi-squared test, so if I can include in the model. I am selecting individually features to enter the model based on the chi-squared. Maybe there is another way to select the most appropriate and significant features for the model. Any suggestions on this would be great.

Topic chi-square-test correlation classification feature-selection

Category Data Science

Cryo · Accepted Answer · 2021年2月8日 16:48

I will raise several issues that could arise if you are selecting features based on chi-2 tests

Repeated use of chi-2 test can lead to spurious results unless you correct for the number of times you run it
You can include features that are correlated with each other, i.a. A is correlated with B, and both are correlated with label. Not sure, but I think, this can lead to results where model performs worse with more features.

I would try starting with all the features, remove the ones linearly correlated. But this is just a suggestion.

Also, mutual information can be used to estimate how well any given feature describes the label.

Chi-Squared test: ok for selecting significant features?

About