Which metric should I use for classifying an imbalace data with fewer labels for the negative class?
From reading, I understand that when we have fewer positive class labels, it is better to use precision or recall as the evaluation metric. Which metric should I use when we have fewer negative samples?
I'm looking for an approach other than switching the labels.
Problem setting: I'm developing parametrized fragility functions for predicting damage to a structure (for example trees). An example of fragiltiy function is here The fragility function will estimate the probability of exceeding a damage state given some parameters (say wind load). The damage state can be expressed in terms of damage ratio (0-1 with 1 being fully damaged). Now, we are interested in estimating the probability of exceeding a damage ratio given features. To elaborate, the probability of any damage would be P(Damage_ratio0.0|features). Logistic regression can be used to learn this curve from data after categorizing 0-1 continuous damage ratio to damaged (- class)/no damage (+ class) for a particular threshold. Now, as we move from the threshold from 0 to 1, the dataset transforms from imbalanced data dominated by damaged cases to a balanced state and then finally to another imbalanced data dominated by non-damage case.
Now, when learning the model, 'AUR-ROC' performs really well when the data is balanced. Precision performs well when the data is imbalanced with more no-damage case (P(Damage_ratio0.1|features)). These metrics don't do well for the case with few negative case (P(Damage_ratio0.9|features)). I tried switching the label with very limited success. Are there any other 'metrics' that perform well in an imbalanced data setting?
Topic class-imbalance classification
Category Data Science