Where and how to do large scale supervised machine learning?

I'm beginner in ML and I have a large dataset that has 15 features with 6M rows, so it becomes challenging to work on it locally. I can train one model locally but to perform hyper parameter tuning and cross validations with my macbook pro, it runs out of memory and lacks the processing speed and capacity. I tried spark but that gives poor results, so I would prefer python native ecosystem of pandas and sklearn.

So I want to know what are my options? How do professionals do it? Should I provision a VM on cloud with high memory and CPU or there are any other cloud based or SAAS platforms that I can checkout

Topic cloud supervised-learning pyspark random-forest scalability

Category Data Science


First, when working with big data most of the time it's more convenient to work with a random subset rather than the whole thing: usually during the design and testing stages there is no need to work with the full data since optimal performance is not needed.

Second, it's often useful to do an ablation study in order to check that using the full data is actually useful for the model. Sometimes training the model with a subset gives the same results as with the full available data, so in this case there's no advantage using all the data.

Finally there are indeed cases where one needs to process a large dataset or run a long training process which cannot be done on a regular computer. There are various options depending on the environment:

  • Buy the required hardware (it's rarely the best option but it needs to be mentioned)
  • Use a commercial cloud service such as AWS
  • Some organizations have their own in-house computing servers/clusters. In particular if you're a student it's likely that you should have access to this kind of service through your university, ask around (afaik most decent universities provide it nowadays).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.