Model a classification problem with multiple categorical varialbes as input features only. Diff Model performance

Question

Model a classification problem with multiple categorical varialbes as input features only. Diff Model performance

Martin

2022年4月24日 21:47

I'm having an input data with 100k rows, 8 input features, I'm trying to predict y (binary 1/0).
But all the X are categorical variables(strictly nominal variables, not ordinal). Some with 8 levels, some with 20 levels.
The data is highly imbalanced. 0.5% of y is 1.
I have cleaned up the data and applied one-hot-encoding to all 8 input variables. Looked up some paper and saw some examples using MCA, but since the input dimensions are small, I don't think it's necessary to apply MCA.

Then I started modelling( 80-20 split with stratified sampling). Since data is highly imbalanced, I use f1-score to assess model performance.

1: DecisionTreeClassifier. f1-score of 0.648 on the test dataset.
gridsearchcv on DecisionTreeClassifier : f1 0.6363 on the test
randomizedSearch on DecisionTreeClassifies: f1 0.6402 on the test.
2: Adaboost : f1-score of 0.499
3: randomforestclassifier: 0.65
4: GradientboostingClassifier: 0.499
5: extraTressClassifier: 0.65

I know the data is highly imbalanced. But Why the boosting model performs 'much worse' compares against other methods? I remember even for boosting model, they were also using tree models at each phase.

This is a medium post that I came across. https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769 But it's not giving a clear solution/answer.

Can someone give some ideas how to handle data like this? And what are the better models for this imbalanced data with only categorical variables as inputs?

Topic imbalanced-data machine-learning-model classification categorical-data

Category Data Science

Model a classification problem with multiple categorical varialbes as input features only. Diff Model performance

About