Determining if a dataset is balanced

I'm learning about training sets and I have been provided with a set of labelled customer data that segments customers into one of two classes: A or B. The dataset also contains gender, age and profession attributes for each customer. The distribution of classes in the dataset is like this:

  • 92% of customers are class A
  • 8% of customers are class B

Based on my understanding, this is an unbalanced dataset because the distribution of classes is not equal. However, I'm confused as to how the other attributes play a role in determining whether or not this dataset is balanced. For example, if my dataset has equal distributions of gender, profession and age values, is the dataset still considered unbalanced because the value I'm trying to train my model to predict (class A or B) is unbalanced?

Alternatively, if my class distribution was equal, would my dataset be considered balanced regardless of the other attributes? For example, if my dataset had 90% female customers and 10% male, but the class distribution was 50% A and 50% B, would the dataset be considered balanced?

My main question is, when determining whether or not my dataset is balanced, should I be looking at the distribution of classes within the dataset or the distribution of the other attributes that may/may not be good predictors of the class?

Topic imbalanced-data

Category Data Science


I am not sure what environment are you using this on. It would help to understand if you provided more information on that.

Answering the question you have, the data set is imbalanced. If you are making a supervised learning model, it helps to have equal amounts of data for each label. Check the frequency distribution for the data set.

You can look at the below mentioned statistics to look for correlation in the data, basically assist to choose the features/columns to predict class A or B.

  1. Correlation matrix - Gives information how much each column relates with the label column.
  2. Clustering algorithms can give you a good visual representation of how the data is naturally grouped.

You should be looking at the frequency distribution of your dependent variable(output feature) when considering imbalance datasets because we are trying to predict the dependent feature and not the independent feature.

The distribution of the independent features doesn't matter when considering imbalanced datasets (though they matter a lot when considering other things such as model selection, feature engineering/selection etc).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.