Why the accuracy of my bagging model heavily affected by random state?

The accuracy of my bagging decision tree model reach up to 97% when I set the random seed=5 but the accuracy reduce to only 92% when I set random seed=0. Can someone explain why the huge gap and should I just use the accuracy with highest value in my research paper or takes the average with random seed=None?

Topic bagging random-forest classification machine-learning

Category Data Science


Can someone explain why the huge gap

It simply means that there's a quite high variance depending which random set of instances is picked. How many times do you re-sample the instances in the bagging process? Probably increasing the number of runs will decrease the variance. As mentioned in a comment, the most common reason for variance in performance is a sample which is too small (and/or a number of features/classes which is too high). It's likely to cause your models to overfit.

and should I just use the accuracy with highest value in my research paper or takes the average with random seed=None?

Never ever use the highest performance across random runs, this is cherry-picking and it doesn't reflect the true performance. The possibility to select a random seed is for reproducibility purposes, and selecting the one which gives the best results is the opposite of the principle of reproducibility.

Since you're using bagging, you should decrease the variance (that's the whole point) by increasing the number of runs. If you can't do that for any reason, then don't use bagging: simply repeat the regular process splitting-training-evaluating $N$ times (with a fixed proportion training/testing data) or use cross-validation, and report the average performance (preferably report the variance as well, e.g. standard deviation).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.