Account for imbalanced data in a Neural Network using prior distribution

Question

Account for imbalanced data in a Neural Network using prior distribution

CutePoison

2021年7月9日 10:55

I have a dataset with 4 classes, say their distribution in the training-set is

$P_{prior}(C1) = 60\% $

$P_{prior}(C2) = 25\% $

$P_{prior}(C3) = 10\% $

$P_{prior}(C4) = 5\% $

After training a Neural Network (on a balanced dataset, i.e after undersampling), I get the output for a new sample as

$P(C1) = 50\%$,

$P(C2) = 10\%$

$P(C3) = 10\%$

$P(C4)=30\%$

Ususally we would just assign the sample to class 1, since it has the greatest outcome. But, if we compare the output to the prior-distribution we get the following ratios

$\tilde{P}(C1)=P(C1)/P_{prior}(C1) = 0.83$

$\tilde{P}(C2)=P(C2)/P_{prior}(C2) = 0.4$

$\tilde{P}(C3)=P(C3)/P_{prior}(C3) = 1 $

$\tilde{P}(C4)=P(C4)/P_{prior}(C4) = 6 $

thus the probability of class 4 is six times greater before we saw the data, and we are actually more certain that it is not class 1 than if we did not see the sample, thus I would argue that we should assign the new sample to class 4, instead of class 1.

Is that approach/thought wrong? And if it is - how would we encounter (on the prediction side, not in the network structure e.g by dropout etc.) for class-imbalance in a Neural Network?

Topic probability-calibration class-imbalance neural-network

Category Data Science

Account for imbalanced data in a Neural Network using prior distribution

About