Is my model overfitting ? Training Acc :93 % test accuracy 82%

I am using LGBM model for binary classification. After hyper-parameter tuning I get

Training accuracy 0.9340
Test accuracy 0.8213

can I say my model is overfitting? Or is it acceptable in the industry?

Also to add to this when I increase the num_leaves for the same model,I am able to achieve:

Train Accuracy : 0.8675
test accuracy : 0.8137 

Which one of these results are acceptable and can be reported?

Topic lightgbm overfitting machine-learning

Category Data Science


You can perform a cross validation and observe the values you get for the accuracy in the different folds. If they differ a lot, your model is overfitting. If they are close enough, your model is fine.


How can we ask this question? Results doesn't simply depends on 'what is the accuracy on training and test sets', there are others side to consider.

What about your data? Is it balanced or not? What type of problem is? Binary or multi class classification?

If you have a binary classification problem where classes are imbalanced, with 95% of samples for the most representative class, and 5% of the other class, then this accuracy is bad, since a dumb classifier which simply always predicts the most representative class should have a 0.95 accuracy, so in that case accuracy isn't the right metric to evaluate your model.

If your problem has 10 balanced classes, then a 0.8+ accuracy is a nice result (compared to the random choice).

Also, to evaluate if your model is or not overfitting, the best thing to do isn't watching accuracy (or recall, precision etc), but the validation loss, since that is what you really want to minimize during the training. If you see that your training loss still decrease while your validation loss starts to increase, then your model is overfitting, since he's not learning, he's just memorizing the training set.


Just asking whether you have controlled the parameters with a proper range. Controlling the parameters like Gamma, max depth, min child weight, subsample, objective, learning rate, n estimators. These parameters in particular should be given a range only after what each of the parameters would do and how they impact.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.