What is the easiest way to scale a data science project based on scikit stack?

Question

What is the easiest way to scale a data science project based on scikit stack?

George Pligoropoulos

2018年1月17日 17:07

This is an issue for all Data Scientists who have worked with this stack:

python
scikit-learn
scipy-stats
matplotlib
etc.

We are looking for ways to have a project already implemented in the aforementioned stack scale for very large datasets by doing the minimum amount of work

Counter examples would be to rewrite everything in Tensorflow framework or use industry tools that are unrelated with Python.

Topic scalability bigdata

Category Data Science

David Marx · Accepted Answer · 2018年1月17日 17:07

The easiest way (depending on the scale we're talking about) is to set n_jobs=-1 for algorithms that support parallelization (e.g. random forest, cross validation, grid search). This will take advantage of all the cores on your machine. If that's not good enough, you should probably move to spark.

Mohammad Athar · Accepted Answer · 2017年10月19日 15:10

You generally don't. Scikit-learn is primarily aimed to help new data scientists quickly get comfortable with data science

That being said, some strategies for scaling are discussed here: http://scikit-learn.org/stable/modules/scaling_strategies.html

This includes using out-of-core models, reducing data size with PCA, and various incremental learners

Besides that, your best bet is to use a beefier computer

Also, remember that once a model is trained, it can be pickled and shared. Training/testing is usually the time/cpu consuming process. So, once you have a model, you should be able to implement it on machines less beefy than the train/test machine

What is the easiest way to scale a data science project based on scikit stack?

About