Should I resample my dataset?

The dataset that I have is some text data consisting of path names. I am using TF-IDF vectorizer and decision trees. The classes in my dataset are severely imbalanced. There are a few big classes with a number of samples more than 500 and some other minor classes with a number of samples less than 100. Some are even smaller (less than 20). This is real data collected, so the chance where the model seeing minor class in actual implementation is rare as well. The problem I am having right now is, the model predicted the minor class as the major class most of the time, causing my accuracy to be at around 45% all the time. If I resample the data, I think the accuracy will be even worse as the model's ability to learn the major class is reduced.

So I would like to ask if I should consider resample my data or anyone have any suggestion on how to improve the accuracy of my model? Any help is appreciated.

Topic decision-trees class-imbalance

Category Data Science


There are several techniques to deal with imbalanced datasets. You can take a look at imbalanced-learn (a python package) documentation to explore several oversampling and undersampling techniques that may help you. These techniques increase the representation of the minority classes in the dataset so that the ML model does not predict that all records belong to the majority class.

When dealing with decision trees (and therefore with random forest) it is also possible to define class weights, so that wrong predictions for the minority classes records are more penalizing. Therefore, the ML model is "forced to pay attention" for the minority classes records. Using sklearn package in python, it can be done by defining class_weight parameter as the output of compute_class_weight function.


Resampling is generally a good idea when dealing with heavily imbalanced datasets. I would recommend using smote which can either resample or undersample depending on your task (if your dataset is small I would say to use resampling). Another thing you could try is using class weights when training. Models like lightgbm and neural nets can use weighted training losses to force models to pay attention to underpopulated classes. I have personally dealt with such cases and these methods definitely work.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.