SMOTE for multi-class balance changes the shape of my dataset

So I have a dataset of shape (430,17), that consists of 13 classes (imbalanced) and 17 features. The end goal is to create a NN which btw works when I import the imblanced dataset, however when i try to over-sample the minority classes using SMOTE in jupyter notebook, the classes do get balanced but also the shape changes.

from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import OneHotEncoder
from imblearn.pipeline import Pipelineenter 

steps = [('onehot', OneHotEncoder()), ('smt', SMOTE())]
pipeline = Pipeline(steps=steps)

X_res, y_res = pipeline.fit_resample(X, y)

The y_res shape is (754,) from y shape which was (430,), so upsampling works, also by checking:

unique, counts = np.unique(y_res, return_counts=True)
print(np.asarray((unique, counts)).T)

the classes have been balanced. However, the X_res shape has now changed to (754, 5553), from X shape which was (430, 17). Then, if I fit these data in my NN it doesnt work of course since the input_dim has changed for my input layer.

My question is, did the SMOTE procedure add not only rows to balance the classes but also columns? Should't I got X_res with shape (754, 17)? and because I need these data for a NN they have to be arrays, or numpys, instead of pd.dataframes, which is also complicated to understand where that 5553 columns come from.

I am new in python and jupyter so I do not know how to solve this, and I would really appreciate any help :)

Topic smote jupyter multiclass-classification neural-network python

Category Data Science


Your understanding is correct: data balancing techniques like SMOTE will only add/remove rows (data points) not columns (features). I suspect your extra dimensions are due to one-hot-encoding.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.