Saving Large Spark ML Pipeline to HDFS

Question

Saving Large Spark ML Pipeline to HDFS

Thomas Cleberg

2021年5月28日 12:47

I'm having trouble saving a large (relative to spark.rpc.message.maxSize) Spark ML pipeline to HDFS. Specifically, when I try to save the model to HDFS, it gives me an error related to spark's maximum message size:

    scala val mod = pipeline.fit(df)
    mod: org.apache.spark.ml.PipelineModel = pipeline_936bcade4716
    scala mod.write.overwrite().save(modelPath.concat("model"))
    18/01/08 10:00:32 WARN TaskSetManager: Stage 8 contains a task of very large size 
    (755610 KB). The maximum recommended task size is 100 KB.
    org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 
    2606:0 was 777523713 bytes, which exceeds max allowed: spark.rpc.message.maxSize 
    (134217728 bytes). Consider increasing spark.rpc.message.maxSize 
    or using broadcast variables for large values.

Making the following assumptions about the problem area:

It's not possible to decrease the size of the model AND
It's not possible to increase the maximum message size to a point where the pipeline would fit in a single message.

Are there any methods that would allow me to save the pipeline successfully to HDFS?

Topic scala apache-spark apache-hadoop

Category Data Science

Felipe Bormann · Accepted Answer · 2021年5月28日 12:47

You can do the following:

A Pipeline can be made of other pipelines. Isn't that great? A Pipeline inherit from the Estimator class and by definition, a PipelineStage can be either an Estimator or a Transformer.

This way, you can build smaller pipelines, save them separately and on the other software/class, join them again as a single one and call transform on the DataFrame.

Saving Large Spark ML Pipeline to HDFS

About