calibrating classifier probabilities for unbalanced data when class ratios are unknown

Question

calibrating classifier probabilities for unbalanced data when class ratios are unknown

Graham501617

2020年7月3日 10:31

I've built a binary classification convolutional neutral network, trained on simulated data with equal numbers of simulations for each class. I've obtained good results for a validation set with equal classes and am using beta regression for calibrating the output probabilities [1]. The classifier will now be applied to an empirical dataset, where the classes are likely very unbalanced. If I knew the true class proportions in the empirical dataset, my approach would be to fit the calibration regression to simulations with the true class proportions (probably resampled from the training set, to avoid the burden of additional simulations). But the relative proportions of each class in the empirical dataset is unknown.

I also recognise that my validation dataset should match the class ratios in the empirical dataset. Based on knowledge of the empirical dataset, I suspect the class ratios will be something like 100:1 (or perhaps even more skewed). However, I would also like to apply the same CNN architecture (trained on different simulations) to other empirical datasets, where I have no knowledge of the ratios, other than that they will be highly skewed.

So I'm interested in general strategies for dealing with this problem. References to relevant papers would be greatly appreciated. My searches on this topic so far have mostly uncovered blog posts about how to learn from imbalanced datasets, which are usually quite introductory and not relevant for me anyway, as I'm training from simulations such that the classes can be chosen to have arbitrary proportions.

[1] http://proceedings.mlr.press/v54/kull17a.html

Topic probability-calibration class-imbalance classification

Category Data Science

calibrating classifier probabilities for unbalanced data when class ratios are unknown

About