How to find whether a dataset is blanced or imbalanced?

I have few dataset to experiment classification(Multi-class). These datasets are about 400GB. I wanted to know whether the dataset is balanced or imbalanced. How to know that dataset is balance or imbalanced using any scientific way?

Topic imbalanced-learn sampling class-imbalance classification machine-learning

Category Data Science

i did it using matplotlib by showing the percentage of each country [ target class] is taking ratio is almost 62:18:19 which shows its a balanced data set

enter image description here

if u want the proper code comment down

In r do the following:

1. convert data frame to tibble to show the data types for each column vector:

df <- InsectSprays
df <- as_tibble(df)
> as_tibble(df)
# A tibble: 72 x 2
   count spray
   <dbl> <fct>
 1    10 A
 2     7 A
 3    20 A
 4    14 A
 5    14 A
 6    12 A
 7    10 A
 8    23 A
 9    17 A
10    20 A
# ... with 62 more rows

2. show dimensions for a column vector with a k-level factor:

    df %>% group_by(spray, .add=TRUE) %>% group_nest()
# A tibble: 6 x 2
  spray               data
  <fct> <list<tbl_df[,1]>>
1 A               [12 x 1]
2 B               [12 x 1]
3 C               [12 x 1]
4 D               [12 x 1]
5 E               [12 x 1]
6 F               [12 x 1]

Typically, the representation of each class in a multi-classification problem should be equal. Say if there are 4 classes, then the ratio of count of samples in each class should ideally be n:n:n:n, most classification data sets do not have exactly same number of sample count in each class, which is fine and a lit bit of difference often does not matter. But if the difference is huge, say for example 100:5:9:13 then it matters and it is an imbalanced dataset.

coming to 400 GB of data to read - Depending on the type of your file, you can read it in chunks and then read and save the target variable( the one which has multi class labels) in another variable.

You can visualize this variable (containing target) using a bar chart which will show you the count of variables for each class. Along with that you can also calculate the distribution of your classes to get better understanding of data.

You can look at the number of samples for each class. Ideally, they all should be of equal proportion. If the sample of one class is considerably high than the rest, then the model will learn to predict that class more often than others and hence leading to overfitting.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.