multi class classification : unbalanced data - good testing results poor prediction results

Question

multi class classification : unbalanced data - good testing results poor prediction results

Swap

2021年9月24日 15:06

I have unbalanced dataset with 11 classes where 1 one class is 30% and rest are between 5-12%. I am not a hardcore programmer so I am using the product from https://www.h2o.ai/. I used GBM and DRF and used the option to balance the classes and the training results are great (98-99% precision and recall) as per the confusion matrix however when I test it on the validation set the only class where I get decent accuracy is the class which is 30% all the others have classification errors close to 100%. Not sure what approach I am supposed to take. The 11 classes are 11 market segments and even an accuracy of ~70% for each of the classes is doable for my purposes.

Edit 1: Additional Info: validation the model predicts almost each sample as the 30% class i.e why close to 30% accuracy...similar to a credit fraud detection gone wrong...

Update 1: So I tried 2 more approaches

1) Tried to make it into a 2 class classification by labelling everything other than the 30 % class as "OTHER" and the results were still poor

2) I removed the 30% class and kept the other 9 as is and then trained a GBM and the results are scary accurate with 85%-15% split between test and validation. However as soon as I do a cross validation the classification is really poor.

Not sure what's going on...maybe I need to rethink my entire approach and redefine the problem and come up with an entirely different hypothesis to begin with.

Topic h2o multiclass-classification class-imbalance classification

Category Data Science

multi class classification : unbalanced data - good testing results poor prediction results

About