statistical significance test between binary label features
I have 667 features and I want to find features that have a significant boundary between a binary class label before I apply a classification model (e.g Naive Bayes/ SVM) to improve classification model learning rate.
What I know is, if the features' values between the two classes are overlapping, this will cause poor classification.
Hence, I have done a 2 samples t-test to calculate the statistical significance of features between binary class label.
from scipy import stats
p=[]
failure = [1]
#separate out the non-failure and failure group data into 2 dataframes to calculate the
t-test between each feature.
df_failure= df.loc[df['label'].isin(failure)]
df_nonfailure= df.loc[~df['label'].isin(failure)]
for x in listofname:
p.append((stats.ttest_ind(df_failure[x],df_nonfailure[x], equal_var=False)))
My question is, is this a good features selection approach beside recursive feature elimination/wrapper method? Is there any similar method out there?
Topic feature-extraction feature-selection python statistics
Category Data Science