Saving Large Spark ML Pipeline to HDFS
I'm having trouble saving a large (relative to spark.rpc.message.maxSize) Spark ML pipeline to HDFS. Specifically, when I try to save the model to HDFS, it gives me an error related to spark's maximum message size:
scala val mod = pipeline.fit(df)
mod: org.apache.spark.ml.PipelineModel = pipeline_936bcade4716
scala mod.write.overwrite().save(modelPath.concat("model"))
18/01/08 10:00:32 WARN TaskSetManager: Stage 8 contains a task of very large size
(755610 KB). The maximum recommended task size is 100 KB.
org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task
2606:0 was 777523713 bytes, which exceeds max allowed: spark.rpc.message.maxSize
(134217728 bytes). Consider increasing spark.rpc.message.maxSize
or using broadcast variables for large values.
Making the following assumptions about the problem area:
- It's not possible to decrease the size of the model AND
- It's not possible to increase the maximum message size to a point where the pipeline would fit in a single message.
Are there any methods that would allow me to save the pipeline successfully to HDFS?
Topic scala apache-spark apache-hadoop
Category Data Science