Is data subsampling appropriate for hyperparameter optimisation?

Question

Is data subsampling appropriate for hyperparameter optimisation?

hH1sG0n3

2022年5月26日 21:04

Fundamentally, under what circumstance is it reasonable to do HPO only on a subsample of the training set?

I am using Population Based Training to optimise hparameters for a sequence model. My dataset consists of 20M sequences and was wondering if it would make sense to optimise on a subsample due to restricted budget.

Topic hyperparameter-tuning deep-learning neural-network

Category Data Science

etiennedm · Accepted Answer · 2020年12月16日 12:00

Your subsample has to be representative of your original dataset.

To do so, as you are in a supervised case, I would get a random subsample that keeps the classes distribution (for instance getting randomly 40% of each class).

Note
If you have classes with too few examples, I would also try not to sample them. Risk is even with random sampling you could loose information when a cluster is too small. Plus, if your problem is computation time, that won't be a problem to keep the too small clusters while sampling the bigs.

Is data subsampling appropriate for hyperparameter optimisation?

About