How does SMOTE work for dataset with only categorical variables?

Question

How does SMOTE work for dataset with only categorical variables?

The Great

2022年2月26日 07:50

I have a small dataset of 977 rows with a class proportion of 77:23.

For the sake of metrics improvement, I have kept my minority class ('default') as class 1 (and 'not default' as class 0).

My input variables are categorical in nature. So, the below is what I tried. Let's assume we don't have age and salary info

a) Apply encoding like rare_encoding and ordinal_encoding to my dataset

b) Split into train and test split (with stratify = y)

c) Apply SMOTE to resample the training data only.

However, my question is on how does SMOTE work/resample when there is only categorical variable like below

gender  degree    occupation    Country     status
 MALE    BE        ENGGINER      USA        default
 MALE    ME        RESEARCHER    UK         default
 FEMALE  BSc       Admin staff   NZ         default  
 FEMALE  MS        Scientist     sweden     default

Now if my objective is to oversample minority sample using SMOTE, How will the above sample look like? Will they just randomly populate/shuffle gender, degree, occupation and country on different permutation and combinations?

Is there any simple explanation or tutorial that you can share for someone who likes to applying this technique?

My objective is to understand how does SMOTE work for categorical variables only dataset?

Topic smote deep-learning neural-network classification machine-learning

Category Data Science

Jan Jitse Venselaar · Accepted Answer · 2022年2月20日 14:32

SMOTE itself wouldn't work well for a categorical only feature-set for a few reasons:

It works by interpolating between different points. Depending on how the data is encoded, you might end up with some undefined class (when using one-hot encoding, you might end up with a point that is half of one class and half of another class), or you might end up with a correct class but it doesn't make any sense from an interpolation point of view (for example, if you encode for example the country on a numerical scale like 1 -> US, 2 -> UK, 3 -> NZ, but it doesn't make much sense to interpolate between US and NZ and end up in UK).
SMOTE uses k-means to select points to interpolate between. If you encode your categorical features using one-hot-encoding, you typically end up with a lot of sparse dimensions (dimensions that most points take only the value 0 in). k-means typically won't perform very well in such a space, and points that are nearby in this space might not look a lot like each other.

What you can do is use a modification of the SMOTE algorithm, called SMOTE-N (see https://imbalanced-learn.org/dev/over_sampling.html#smote-variants), which works when all features are categorical. This modifies the SMOTE algorithm to

Use a different interpolation method: selects the most common class of the nearest neighbors
Use a different distance metric (Value Difference Metric) instead of Euclidean distance in the encoded space.

In that link this method is attributed to the original SMOTE paper (https://www3.nd.edu/~dial/publications/chawla2002smote.pdf) where it's found in Section 6.2. There is also SMOTE-NC which is a combination of SMOTE and SMOTE-N for data which has both numerical and categorical features.

For your example, let's say for some reason 3 of the points given: MALE ME RESEARCHER UK default FEMALE BSc Admin staff NZ default
FEMALE MS Scientist sweden default

are considere nearby each other and are used for interpolation. Then a possible added point by SMOTE-N would be:

FEMALE (because that's the majority class)
MS (all 3 classses have equal frequency, so a class is randomly picked)
RESEARCHER (idem to above)
NZ (idem to above)
default (majority class)

How does SMOTE work for dataset with only categorical variables?

About