Low classification accuracy

I want to do a multi class classification with 6 classes. Whole dataset has 12750 and 56 features samples, so every class has 2125 samples. Before prediction I reduces amount of outliers by winsorization (for 1 and 99 percentile) and I reduced skewness in features which has more than 1 and less than -1 skewness by Yeo-Johnson transformation and I got dataset:

https://i.stack.imgur.com/miy8i.png

Later, of course, I splitted dataset for 80% of training data and 20% of test data and I standardised training data. I tried to use random forest, xgboost and decision tree classifiers, but I have almost 100% accuracy on training set and 20-21% accuracy on test set. Methods like increasing n_estimators doesn't help.

So, my questions are:

How can I reduce this overfitting? Is it a problem with dataset (Should I reduce number of features something like that?) or with classificators (Are they too weak for this problem?)

Is the dataset too small for this problem (Should I add more samples by method like SMOTE?)? Do classes have too less samples to good work?

Is it possible to get at least 60% accuracy after tuning hyperparameters (e.g. by method like GridSearchCV)?

P.S. I will add that correlations with target value are very poor (max +- 6%) and I see that feature importances from random forest have values from 0.0 to 0.03. I don't know if this is a normal situation.

P.S.2 I tried to change n_estimators parameters (values from 5 to 1500) and max_depth (from 1 to 100) and I can see very poor change in test accuracy (+-3%)

Topic overfitting prediction multiclass-classification classification machine-learning

Category Data Science


A few comments:

  • First, you say that every class has 2125 samples: I'm surprised that it's perfectly balanced, is this the real distribution of the data? It would be a mistake to artificially balance the classes, especially before the split between training and test set.
  • In principle the preprocessing should be done based only on the training set, but this is unlikely to cause a serious problem here.
  • Overfitting happens when the model captures things which happen by chance in the training data instead of real patterns, either because it doesn't have enough instances or because it's trying to find too subtle details. Generating artificial instances doesn't always work, because it breaks the distribution of the data. So the simplest solution is to make the model simpler: reduce the number of features, reduce the number of parameters (e.g depth of the tree), etc.
  • Keep in mind that a random baseline classifier would obtain only 16.6% accuracy with 6 classes. So 20% is not that bad, it depends how difficult the task is, i.e. how much the features help to determine the target. There's no guarantee at all that 60% can be reached with any data and any task. However it must be possible not to overfit the model, i.e. to obtain quite similar performance on the training and test set.

There are few things I can think of:

  1. Did you use stratified split for train/test sets?
  2. Try using validation set and doing early stopping based on validation loss.
  3. Try using N-fold cross validation.
  4. Add regularization to your models and use GridSearchCV to play with their parameters.

It is difficult to say anything more in particular without looking at your data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.