What is the easiest way to scale a data science project based on scikit stack?

This is an issue for all Data Scientists who have worked with this stack:

  • python
  • scikit-learn
  • scipy-stats
  • matplotlib
  • etc.

We are looking for ways to have a project already implemented in the aforementioned stack scale for very large datasets by doing the minimum amount of work

Counter examples would be to rewrite everything in Tensorflow framework or use industry tools that are unrelated with Python.

Topic scalability bigdata

Category Data Science


The easiest way (depending on the scale we're talking about) is to set n_jobs=-1 for algorithms that support parallelization (e.g. random forest, cross validation, grid search). This will take advantage of all the cores on your machine. If that's not good enough, you should probably move to spark.


You generally don't. Scikit-learn is primarily aimed to help new data scientists quickly get comfortable with data science

That being said, some strategies for scaling are discussed here: http://scikit-learn.org/stable/modules/scaling_strategies.html

This includes using out-of-core models, reducing data size with PCA, and various incremental learners

Besides that, your best bet is to use a beefier computer

Also, remember that once a model is trained, it can be pickled and shared. Training/testing is usually the time/cpu consuming process. So, once you have a model, you should be able to implement it on machines less beefy than the train/test machine

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.