unbalanced data classification

I used XGBoost to predict company's bankruptcy, which is an extremely unbalanced dataset. Although I tried weighting method as well as parameter tuning, the best result which I could obtain is as follows:

Best Parameters: {'clf__gamma': 0.1, 'clf__scale_pos_weight': 30.736842105263158, 'clf__min_child_weight': 1, 'clf__max_depth': 9}
Best Score: 0.219278428798
Accuracy: 0.966850828729
AUC: 0.850038850039
F1 Measure: 0.4
Cohen Kappa: 0.383129792673
Precision: 0.444444444444
recall: 0.363636363636

Confusion Matrix:
[[346   5]
 [  7   4]]

As the confusion matrix shows my model can not identify bankrupted companies very well, which results in poor performance measures such as precision, recall, Cohen's kappa, F measure. Also, I tried BlaggingClassifier, which is presented here. Really, it gives the following result:

Best Parameters: {'clf__n_estimators': 64}
Best Score: 0.133676613659
Accuracy: 0.809392265193
AUC: 0.819606319606
F1 Measure: 0.188235294118
Cohen Kappa: 0.142886555487
Precision: 0.108108108108
recall: 0.727272727273
Confusion Matrix: [[285  66]
 [  3   8]]

As it is shown, it predicts positive classes well but it does poor on negative class ( too many false positive). Could you please let me know how it is possible to combine these two classifier results in order to obtain a better result? A simple way to combine two classifier is to use a convex linear combination for predicted probabilities: t * p1 + (1 - t) * p2, where 0 = t = 1 and p1, p2 are predictions of the two classifiers. Then , I should search for an optimal value of t over a grid but I don't know how to do it?

I read that anomaly detection for example,one class svm and Isolation Forest could be used for extremely unbalanced data set, so could you please let me know how to do it i.e., by an example code? In general, I would appreciate if you could let me know how to deal with this issue.

Topic imbalance python-3.x class-imbalance classification

Category Data Science


For starters, I am not confident that combining the results will give you what you expect. Have you checked if the true negative remain the same in both occurrences?

Moreover, have you tried to alternate the hyper parameters on the XGBoost model responsible for class balancing like max delta step or scale pos weight.

You can also try to use Sampling techniques for Over Sampling and Under Sampling. Or try possibly using many trees on XGBoost on a different random Under Sample.

P.S. I am interested on the anomaly detection paper you mentioned, could you provide a link, thnx.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.