Pyspark Dataframes to Pandas and ML Ops - Parallel Execution Hold?

If I convert a spark dataframe into a pandas dataframe and subsequently apply pandas operations and sklearn models to the dataset in databricks, will the operations from pandas and sklearn be distributed across the cluster? Or do i have to use pyspark dataframe operations and pyspark ml packages for operations to be distributed?

Topic pyspark apache-spark pandas dataset machine-learning

Category Data Science


Short answer: NO.

The moment you convert the spark dataframe into a pandas dataframe, all of the subsequent operations (pandas, ml etc.) will be run on a single-core as those algorithms and programs are written in native-python and doesn't support multi-core distributions. In a nutshell, someone has to rewrite the whole sklearn to again to be compatible with spark.

Said that there are some progress to bring e.g. to distribute scikit-learn on a Spark cluster. Some functionalities like the expensive GridSearch are rewritten in spark and you can use it together with native sklearn in databricks, see this post, or Joblib Apache Spark Backend (previously known as spark-sklearn). This blog post is also worth reading as well.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.