unbalanced data classification
I used XGBoost to predict company's bankruptcy, which is an extremely unbalanced dataset. Although I tried weighting method as well as parameter tuning, the best result which I could obtain is as follows:
Best Parameters: {'clf__gamma': 0.1, 'clf__scale_pos_weight': 30.736842105263158, 'clf__min_child_weight': 1, 'clf__max_depth': 9}
Best Score: 0.219278428798
Accuracy: 0.966850828729
AUC: 0.850038850039
F1 Measure: 0.4
Cohen Kappa: 0.383129792673
Precision: 0.444444444444
recall: 0.363636363636
Confusion Matrix:
[[346 5]
[ 7 4]]
As the confusion matrix shows my model can not identify bankrupted companies very well, which results in poor performance measures such as precision, recall, Cohen's kappa, F measure. Also, I tried BlaggingClassifier, which is presented here. Really, it gives the following result:
Best Parameters: {'clf__n_estimators': 64}
Best Score: 0.133676613659
Accuracy: 0.809392265193
AUC: 0.819606319606
F1 Measure: 0.188235294118
Cohen Kappa: 0.142886555487
Precision: 0.108108108108
recall: 0.727272727273
Confusion Matrix: [[285 66]
[ 3 8]]
As it is shown, it predicts positive classes well but it does poor on negative class ( too many false positive). Could you please let me know how it is possible to combine these two classifier results in order to obtain a better result? A simple way to combine two classifier is to use a convex linear combination for predicted probabilities: t * p1 + (1 - t) * p2, where 0 = t = 1 and p1, p2 are predictions of the two classifiers. Then , I should search for an optimal value of t over a grid but I don't know how to do it?
I read that anomaly detection for example,one class svm and Isolation Forest could be used for extremely unbalanced data set, so could you please let me know how to do it i.e., by an example code? In general, I would appreciate if you could let me know how to deal with this issue.
Topic imbalance python-3.x class-imbalance classification
Category Data Science