Is data subsampling appropriate for hyperparameter optimisation?

Fundamentally, under what circumstance is it reasonable to do HPO only on a subsample of the training set?

I am using Population Based Training to optimise hparameters for a sequence model. My dataset consists of 20M sequences and was wondering if it would make sense to optimise on a subsample due to restricted budget.

Topic hyperparameter-tuning deep-learning neural-network

Category Data Science


Your subsample has to be representative of your original dataset.

To do so, as you are in a supervised case, I would get a random subsample that keeps the classes distribution (for instance getting randomly 40% of each class).

Note
If you have classes with too few examples, I would also try not to sample them. Risk is even with random sampling you could loose information when a cluster is too small. Plus, if your problem is computation time, that won't be a problem to keep the too small clusters while sampling the bigs.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.