under sample to get specific number of samples per class using tomek links of imblearn
I have a dataset with classes in my target column distributed like shown below.
counts percents
6 1507 27.045944
3 1301 23.348887
5 661 11.862886
4 588 10.552764
7 564 10.122039
8 432 7.753051
1 416 7.465901
2 61 1.094760
9 38 0.681981
10 4 0.071788
I would like to under sample my data and include, only 588 samples for a class at maximum; so that the classess 6, 3 5 only have ~588 samples available after undersampling.
Here's what I have tried, but to no avail:
from imblearn.under_sampling import TomekLinks
# function(callable) to generate a dictin
def tomek_xtrain_func(y_series):
return {6:588, 3: 588, 5: 588}
# by passing a callable
tomek_for_xtrain = TomekLinks(sampling_strategy=tomek_xtrain_func, n_jobs=-1)
tomeked_X_train, tomeked_y_train = tomek_for_xtrain.fit_resample(X_train, y_train)
this generate following ValueError
:
ValueError: 'sampling_strategy' as a dict for cleaning methods is not supported. Please give a list of the classes to be targeted by the sampling.
so, I resorted to passing a list
with targeted classes like this:
tomek_for_xtrain = TomekLinks(sampling_strategy=[6, 3, 5], n_jobs=-1)
tomeked_X_train, tomeked_y_train = tomek_for_xtrain.fit_resample(X_train, y_train)
But, this generates following distribution and doesn't help me bring down the difference in ratio between under-occurred class and the most-occurred to what I want.
counts percents
6 1046 23.131358
3 937 20.720920
4 588 13.003096
7 564 12.472357
5 436 9.641751
8 432 9.553295
1 416 9.199469
2 61 1.348961
9 38 0.840336
10 4 0.088456
How do I achive this using an under-sampling strategy that retains closely occuring samples(in X)?
Topic imbalanced-learn sampling class-imbalance python
Category Data Science