statistical significance test between binary label features

I have 667 features and I want to find features that have a significant boundary between a binary class label before I apply a classification model (e.g Naive Bayes/ SVM) to improve classification model learning rate.

What I know is, if the features' values between the two classes are overlapping, this will cause poor classification.

Hence, I have done a 2 samples t-test to calculate the statistical significance of features between binary class label.

from scipy import stats
p=[]
failure = [1]

#separate out the non-failure and failure group data into 2 dataframes to calculate the 
t-test between each feature.

df_failure= df.loc[df['label'].isin(failure)]
df_nonfailure= df.loc[~df['label'].isin(failure)]

for x in listofname:

     p.append((stats.ttest_ind(df_failure[x],df_nonfailure[x], equal_var=False)))

My question is, is this a good features selection approach beside recursive feature elimination/wrapper method? Is there any similar method out there?

Topic feature-extraction feature-selection python statistics

Category Data Science


I would be hesitant to use a statistical test for feature importance.

You do not mention the sample size. Statistical test values are driven by sample size. Very small and very large values could distort the method.

Also, this should be part of a cross-validation strategy. If you perform feature selection on all of the data and then cross-validate on only a subset, then the validation data in each cross-validation fold might be used again choose the features. This might biases the performance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.