Should I resample my dataset?
The dataset that I have is some text data consisting of path names. I am using TF-IDF vectorizer and decision trees. The classes in my dataset are severely imbalanced. There are a few big classes with a number of samples more than 500 and some other minor classes with a number of samples less than 100. Some are even smaller (less than 20). This is real data collected, so the chance where the model seeing minor class in actual implementation is rare as well. The problem I am having right now is, the model predicted the minor class as the major class most of the time, causing my accuracy to be at around 45% all the time. If I resample the data, I think the accuracy will be even worse as the model's ability to learn the major class is reduced.
So I would like to ask if I should consider resample my data or anyone have any suggestion on how to improve the accuracy of my model? Any help is appreciated.
Topic decision-trees class-imbalance
Category Data Science