under sample to get specific number of samples per class using tomek links of imblearn

I have a dataset with classes in my target column distributed like shown below.

    counts   percents
6     1507  27.045944
3     1301  23.348887
5      661  11.862886
4      588  10.552764
7      564  10.122039
8      432   7.753051
1      416   7.465901
2       61   1.094760
9       38   0.681981
10       4   0.071788

I would like to under sample my data and include, only 588 samples for a class at maximum; so that the classess 6, 3 5 only have ~588 samples available after undersampling.

Here's what I have tried, but to no avail:

from imblearn.under_sampling import TomekLinks

# function(callable) to generate a dictin
def tomek_xtrain_func(y_series):
    return {6:588, 3: 588, 5: 588}

# by passing a callable
tomek_for_xtrain = TomekLinks(sampling_strategy=tomek_xtrain_func, n_jobs=-1)
tomeked_X_train, tomeked_y_train = tomek_for_xtrain.fit_resample(X_train, y_train)

this generate following ValueError:

ValueError: 'sampling_strategy' as a dict for cleaning methods is not supported. Please give a list of the classes to be targeted by the sampling.

so, I resorted to passing a list with targeted classes like this:

tomek_for_xtrain = TomekLinks(sampling_strategy=[6, 3, 5], n_jobs=-1)
tomeked_X_train, tomeked_y_train = tomek_for_xtrain.fit_resample(X_train, y_train)

But, this generates following distribution and doesn't help me bring down the difference in ratio between under-occurred class and the most-occurred to what I want.

    counts   percents
6     1046  23.131358
3      937  20.720920
4      588  13.003096
7      564  12.472357
5      436   9.641751
8      432   9.553295
1      416   9.199469
2       61   1.348961
9       38   0.840336
10       4   0.088456

How do I achive this using an under-sampling strategy that retains closely occuring samples(in X)?

Topic imbalanced-learn sampling class-imbalance python

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.