Model a classification problem with multiple categorical varialbes as input features only. Diff Model performance
I'm having an input data with 100k rows, 8 input features, I'm trying to predict y (binary 1/0).
But all the X are categorical variables(strictly nominal variables, not ordinal). Some with 8 levels, some with 20 levels.
The data is highly imbalanced. 0.5% of y is 1.
I have cleaned up the data and applied one-hot-encoding to all 8 input variables. Looked up some paper and saw some examples using MCA, but since the input dimensions are small, I don't think it's necessary to apply MCA.
Then I started modelling( 80-20 split with stratified sampling). Since data is highly imbalanced, I use f1-score to assess model performance.
1: DecisionTreeClassifier. f1-score of 0.648 on the test dataset.
gridsearchcv on DecisionTreeClassifier : f1 0.6363 on the test
randomizedSearch on DecisionTreeClassifies: f1 0.6402 on the test.
2: Adaboost : f1-score of 0.499
3: randomforestclassifier: 0.65
4: GradientboostingClassifier: 0.499
5: extraTressClassifier: 0.65
I know the data is highly imbalanced. But Why the boosting model performs 'much worse' compares against other methods? I remember even for boosting model, they were also using tree models at each phase.
This is a medium post that I came across. https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769 But it's not giving a clear solution/answer.
Can someone give some ideas how to handle data like this? And what are the better models for this imbalanced data with only categorical variables as inputs?