Impose similar metric on segments to model

I am training a binary classifier in a dataset using AUC as a score. The dataset has two main groups (we will refer to them as good and bad population). A property that this dataset has is having a higher proportion of target = 1 in the bad population.

For this reason, a relatively dummy classifier would give higher scores to the bad population and lower scores to the good population. In fact, the AUC of the classifier could be pretty high globally, and, when looking at the AUC inside both populations separately, the AUC might be really low in both of them.

I want to avoid this behavior. In fact, I am willing to sacrifice some AUC in the global population such that the AUC in each group is not very low. An idea that I had was using the harmonic mean of the AUC of both groups as a metric instead of the general AUC. However, this might not really help a classifier in a natural way.

Are there any papers/techniques/software that can help me in solving this problem in a more natural way?

Topic binary classification machine-learning

Category Data Science


Given that in your data there is correlation between population type (good vs. bad) and target, your model may learn undesirable associations between both. Therefore, the population type is a confounding factor.

A natural tool to cope with scenarios with confounders it causal inference. You can find an overview of causal inference in Judea Pearl's work, either this article or his book. A terser introduction to causal inference can be found in Ferenc Huszár's blog including an entry for controlling confounders.

There are a few python packages providing causal inference functionality, such a Microsoft's dowhy or Causalinference.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.