How and where to set weights in case of imbalanced cost sensitive learning in machine learning?
I confront with a binary classification machine learning task which is both slightly imbalanced and cost sensitive. I wonder what (and where in the modeling pipeline, say, in sklearn) is the best way to take all these considerations into account.
Class proportionality: positive: 0.25% negative: 0.75%. This could be addressed with sklearn.utils.class_weigh.compute_class_weight
:
class_weights = compute_class_weight(y=y, class_weight='balanced')
OK, but this is only for rebalancing proportionalty, I should take misclassification cost into consideration as well. Let's say that this is 10* larger in case of false negatives than false positives, so I guess that I should still increase the weights in class_weights
accordingly by upweighting positives further by 10, right?
But there is another point in the pipeline where I could take care of this, namely in the evaluation metrics with F-beta for example with upweighted recall (F2, for instance). Does it have the same effect? Should I pick one method (F-beta for evaluation OR upweighting classes) or both of them simultaneously?.
Additionally, in case I upweight my classes with compute_class_weight()
, I assume that no further class distribution should be taken into consideration downstream (so when I use RandomForestClassifier()
, class_weight
hyperparameter shouldn't be ='balanced'
, again, because this would further distort the weights proportionality that is already set before. Is this correct?
Topic imbalance weighted-data evaluation scikit-learn machine-learning
Category Data Science