Spark: How to run PCA parallelized? Only one thread used

Question

Spark: How to run PCA parallelized? Only one thread used

bonfab

2022年2月28日 16:04

I use pySpark and set my configuration like following:

spark = (SparkSession.builder.master(local[*])
        .config(spark.driver.memory, 20g)
        .config(spark.executor.memory, 10g)
        .config(spark.driver.cores, 30)
        .config(spark.num.executors, 8)
        .config(spark.executor.cores, 4)
        .getOrCreate())
sc = spark.sparkContext

If I then run PCA:

from pyspark.ml.feature import PCA

pca = PCA(k=50, inputCol=features, outputCol=pcaFeatures)
model = pca.fit(train)

Only one thread is active and therefore the computation takes a long time.

How can I parallelize PCA in Spark?

I run on a local machine and did not configure a cluster in the configs.

Also I did not install the recommended ml packages, since the warning

WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK

appears.

Topic pca pyspark apache-spark bigdata machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2022年1月24日 02:17

According to the MLlib Linear Algebra Acceleration Guide documentation, LAPACK and related libraries need to be installed and configured corrected to get the full speed-up of Spark.

Additionally, the documentation mentions that sometimes there might not be a speed-up. That could be a result in your case because of running on a local machine compared to running on a cluster.

Spark: How to run PCA parallelized? Only one thread used

About