Pyspark Dataframes to Pandas and ML Ops - Parallel Execution Hold?

Question

Pyspark Dataframes to Pandas and ML Ops - Parallel Execution Hold?

Soorya Paturi

2022年5月15日 18:04

If I convert a spark dataframe into a pandas dataframe and subsequently apply pandas operations and sklearn models to the dataset in databricks, will the operations from pandas and sklearn be distributed across the cluster? Or do i have to use pyspark dataframe operations and pyspark ml packages for operations to be distributed?

Topic pyspark apache-spark pandas dataset machine-learning

Category Data Science

TwinPenguins · Accepted Answer · 2020年3月18日 07:26

Short answer: NO.

The moment you convert the spark dataframe into a pandas dataframe, all of the subsequent operations (pandas, ml etc.) will be run on a single-core as those algorithms and programs are written in native-python and doesn't support multi-core distributions. In a nutshell, someone has to rewrite the whole sklearn to again to be compatible with spark.

Said that there are some progress to bring e.g. to distribute scikit-learn on a Spark cluster. Some functionalities like the expensive GridSearch are rewritten in spark and you can use it together with native sklearn in databricks, see this post, or Joblib Apache Spark Backend (previously known as spark-sklearn). This blog post is also worth reading as well.

Pyspark Dataframes to Pandas and ML Ops - Parallel Execution Hold?

About