Chi-Squared test: ok for selecting significant features?

I would have a question on the contingency table and its results. I was performing this analysis on names starting with symbols as a possible feature, getting the following values:

Label          0.0  1.0     
with_symb      1584 241
without_symb     16 14

getting a p-value which lets met conclude that variables are associated (since it is less than 0.05). My question is if this result might be a good result based on the chi-squared test, so if I can include in the model. I am selecting individually features to enter the model based on the chi-squared. Maybe there is another way to select the most appropriate and significant features for the model. Any suggestions on this would be great.

Topic chi-square-test correlation classification feature-selection

Category Data Science


I will raise several issues that could arise if you are selecting features based on chi-2 tests

  1. Repeated use of chi-2 test can lead to spurious results unless you correct for the number of times you run it

  2. You can include features that are correlated with each other, i.a. A is correlated with B, and both are correlated with label. Not sure, but I think, this can lead to results where model performs worse with more features.

I would try starting with all the features, remove the ones linearly correlated. But this is just a suggestion.

Also, mutual information can be used to estimate how well any given feature describes the label.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.