Subsampling the “right” amout of data to train an ML model

I am training a machine learning model (i.e., a classifier) on a large dataset. I know that I can get the same results using less data (about 30%) but I would like to avoid the trial and error process to find the 'right' amount of data to retain from the dataset.

Of course I can create a script which automatically tried different thresholds but I was wondering if there is any principled way of doing this. It seems strange that nobody ever tried to create a proper solution since this seems to be a very common problem.

Some additional criteria:

  • I am using sub sampling on a stream of data so it would be better to find something that works in this setting
  • I would prefer to avoid training the classifier more than once since it takes some time
  • I appreciate theoretically justified approaches

Any suggestion or reference?

Topic sampling classification bigdata

Category Data Science


There are many rule based sampling techniques out there as opposed to just sampling randomly and hoping it works. The idea is to sample proportionately so that the model does not learn any biases or does not leave out minorities. Hope this helps!

PS: Some articles will prompt you to sign in after a few free articles. Just open them in icognito! ;)

Cheers!


Every Machine Learning Problem is different, so there is no standard answer to your question. For the problem you're working on maybe a 70-30 train-test split would result in an optimal model which performs equally well on the test dataset, whereas for another problem may be that ratio just won't do any justice to the model. It's all about experimentation.

Basically, While training your model, you are basically trying to teach the model all relations, dependencies among attributes which can help the model create a clearer decision boundary between data points associated with your response variable.

Less training may not achieve the task as the model may not learn the underlying structure of the data and hence how the attributes are linked to the response variable, too much training data can have adverse affects too as then you may overfit. I would recommend that start from a 50-50 split record the model performance based on the metric you chose and then repeat the exercise for 60-40, 70-30 etc.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.