Subsampling the “right” amout of data to train an ML model
I am training a machine learning model (i.e., a classifier) on a large dataset. I know that I can get the same results using less data (about 30%) but I would like to avoid the trial and error process to find the 'right' amount of data to retain from the dataset.
Of course I can create a script which automatically tried different thresholds but I was wondering if there is any principled way of doing this. It seems strange that nobody ever tried to create a proper solution since this seems to be a very common problem.
Some additional criteria:
- I am using sub sampling on a stream of data so it would be better to find something that works in this setting
- I would prefer to avoid training the classifier more than once since it takes some time
- I appreciate theoretically justified approaches
Any suggestion or reference?
Topic sampling classification bigdata
Category Data Science