under sample to get specific number of samples per class using tomek links of imblearn

Question

under sample to get specific number of samples per class using tomek links of imblearn

Naveen Reddy Marthala

2022年4月24日 07:05

I have a dataset with classes in my target column distributed like shown below.

    counts   percents
6     1507  27.045944
3     1301  23.348887
5      661  11.862886
4      588  10.552764
7      564  10.122039
8      432   7.753051
1      416   7.465901
2       61   1.094760
9       38   0.681981
10       4   0.071788

I would like to under sample my data and include, only 588 samples for a class at maximum; so that the classess 6, 3 5 only have ~588 samples available after undersampling.

Here's what I have tried, but to no avail:

from imblearn.under_sampling import TomekLinks

# function(callable) to generate a dictin
def tomek_xtrain_func(y_series):
    return {6:588, 3: 588, 5: 588}

# by passing a callable
tomek_for_xtrain = TomekLinks(sampling_strategy=tomek_xtrain_func, n_jobs=-1)
tomeked_X_train, tomeked_y_train = tomek_for_xtrain.fit_resample(X_train, y_train)

this generate following ValueError:

ValueError: 'sampling_strategy' as a dict for cleaning methods is not supported. Please give a list of the classes to be targeted by the sampling.

so, I resorted to passing a list with targeted classes like this:

tomek_for_xtrain = TomekLinks(sampling_strategy=[6, 3, 5], n_jobs=-1)
tomeked_X_train, tomeked_y_train = tomek_for_xtrain.fit_resample(X_train, y_train)

But, this generates following distribution and doesn't help me bring down the difference in ratio between under-occurred class and the most-occurred to what I want.

    counts   percents
6     1046  23.131358
3      937  20.720920
4      588  13.003096
7      564  12.472357
5      436   9.641751
8      432   9.553295
1      416   9.199469
2       61   1.348961
9       38   0.840336
10       4   0.088456

How do I achive this using an under-sampling strategy that retains closely occuring samples(in X)?

Topic imbalanced-learn sampling class-imbalance python

Category Data Science

under sample to get specific number of samples per class using tomek links of imblearn

About