If a categorical feature only occurs a few times in a data set, should I drop it?

Question

If a categorical feature only occurs a few times in a data set, should I drop it?

dawndance

2021年8月6日 09:33

I have a data set of mostly categorical variables. When I one-hot encoded them some of the features occur less than 3% of the time.

For instance the Tech-support feature only occurs 928 times in a data set with 32561 samples ie. it only occurs 2.9% of the time.

Is there a general cutoff point for when I should scrap these variables? I'm cleaning up this data set for binary logistic regression and an SVM.

Thank you!

Topic features one-hot-encoding logistic-regression svm

Category Data Science

spectre · Accepted Answer · 2021年8月6日 09:33

Since you mentioned after one hot encoding, you get 90 features, I would advise against dropping the less occurring values. 90 features is considered less for a ml algo. If for example your dimensionality increases upto 400-500 features or more, only then consider dropping less frequent values.

Also first consider feature engineering to reduce the no of features in your data. This is the most effective and non invasive way to reduce dimensionality.

After that consider feature selection and/or PCA.

Only then consider dropping the less frequent values as dropping them will result in loss of valuable info.

rigo · Accepted Answer · 2020年2月8日 07:26

It depends on your use case. if your use case is "prevent people from dying" or "find online customers" then any category that has marginal theoretical predictive capabilities should be analysed. If instead, your aim is to improve general wellness or brand awareness, then the category should be dropped. If your use case is rare, then rare categories will be important. If your use case is general, then rare categories are not important.

Erwan · Accepted Answer · 2020年2月8日 01:03

Indeed it's often a good idea to remove boolean features which are very rare, but the problem is that choosing a threshold by intuition is not necessarily optimal. Whenever possible the optimal value should be determined experimentally, and typically that should be possible for efficient methods such as log regression or SVM. The idea is simply to consider a range of values as threshold and run a grid search using a subset of the training set. The threshold is exactly like an hyper-parameter for the learning method.

If a categorical feature only occurs a few times in a data set, should I drop it?

About