How should I choose n_features in FeatureHasher in sklearn?

How should I choose n_features for FeatureHasher in scikit-learn? Assume that I have 1000 categories in feature case and I would like to hash them.

Topic hashing-trick data scikit-learn machine-learning

Category Data Science


As mentioned in its documentation, it is advisable to use a power of 2 as the number of features; otherwise, the features will not be mapped evenly to the columns. Also, it is suggested to leave the number of features as its default value of 2 ** 20 for a real-world setting. Select a lower value such as 2 ** 18 when memory or downstream model size is an issue Ref.

Consider in mind that, as also stated in the documentation, small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners Ref

My overall suggestion is to use a power of 2 as the number of features. If you have a small number of features, you can treat the number of features (e.g., 30 as you mentioned in the comment below) as a hyperparameter and find the optimal value using cross-validation. For example, you can test different powers of 2 such as 2, 4, 8, 16, etc depending on the size of your data and use the cross-validation to find the optimal value. That is the best solution. But please note that the hashing method is efficient when the number of input features is very large. In your case, I would go with the other methods available in the technical literature.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.