Determining if a dataset is balanced
I'm learning about training sets and I have been provided with a set of labelled customer data that segments customers into one of two classes: A or B. The dataset also contains gender, age and profession attributes for each customer. The distribution of classes in the dataset is like this:
- 92% of customers are class A
- 8% of customers are class B
Based on my understanding, this is an unbalanced dataset because the distribution of classes is not equal. However, I'm confused as to how the other attributes play a role in determining whether or not this dataset is balanced. For example, if my dataset has equal distributions of gender, profession and age values, is the dataset still considered unbalanced because the value I'm trying to train my model to predict (class A or B) is unbalanced?
Alternatively, if my class distribution was equal, would my dataset be considered balanced regardless of the other attributes? For example, if my dataset had 90% female customers and 10% male, but the class distribution was 50% A and 50% B, would the dataset be considered balanced?
My main question is, when determining whether or not my dataset is balanced, should I be looking at the distribution of classes within the dataset or the distribution of the other attributes that may/may not be good predictors of the class?
Topic imbalanced-data
Category Data Science