Spark: How to run PCA parallelized? Only one thread used
I use pySpark and set my configuration like following:
spark = (SparkSession.builder.master(local[*])
.config(spark.driver.memory, 20g)
.config(spark.executor.memory, 10g)
.config(spark.driver.cores, 30)
.config(spark.num.executors, 8)
.config(spark.executor.cores, 4)
.getOrCreate())
sc = spark.sparkContext
If I then run PCA:
from pyspark.ml.feature import PCA
pca = PCA(k=50, inputCol=features, outputCol=pcaFeatures)
model = pca.fit(train)
Only one thread is active and therefore the computation takes a long time.
How can I parallelize PCA in Spark?
I run on a local machine and did not configure a cluster in the configs.
Also I did not install the recommended ml packages, since the warning
WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
appears.
Topic pca pyspark apache-spark bigdata machine-learning
Category Data Science