How to group multiple categories of a categorical variable before feeding the data to a machine learning algorithm?
I have a labelled dataset to which I wish to fit a classification model (say, a Decision Tree). One of the categorical variables (say STATE
) in the data has a lot of categories (say 100 different STATES).
Using One-Hot encoding on such categorical variables would create very sparse features, deteriorating the performance of the model. There are other methods of encoding of course, like binary encoding, But they introduce bias in some non-trivial ways.
Some articles suggest we group different categories (here different STATES) into one category looking at their weight of evidence (WoE) i.e., categories having similar WoE are to be treated as a single category. But they also say that this type of grouping makes sense if we use a Logistic Regression model rather than a Decision Tree.
My question: Is there any standard way of grouping different categories?
Further, In a Decision Tree if we want to split on a categorical variable with $k$ categories, ideally we should be considering $\frac{1}{2}(2^{k} -2)$ or $2^{k-1} -1$ splits and then take the best split. It is computationally expensive. But is there a fast and approximate (greedy) way of doing this? If there is, then we wouldn't need to think about grouping the categories.
Topic cart categorical-encoding classification categorical-data
Category Data Science