Lightweight execution of Spark MLLib models

I have some training data which I am using to build a Spark MLLib model which is in a Hive database. I am using simple linear regression models and the PySpark API.

I have a code set up to train this model every day to get the most up-to-date model. (the real-world use case is that I am predicting vehicle unloading times, and my model must always be recently trained since the characteristics of the vehicles and locations change over time.)

However, when I use my model for inference I want to do it from an existing Java codebase. I need fast inference for individual data points, not batch inference. So I need a lightweight low latency way of calculating inference.

One solution that I have found is to export the parameters of my MLLib model into PMML or another representation, and then re-implement the inference code in pure Java without any of the boilerplate that would come with Spark. So I have a function like this:

    public static double predict(double[] parameters) {
        double prediction = bias + weights[0] * parameters[0] +  ....;
        return prediction;
    }

where the array weights are updated every day with values exported from a trained MLLib model.

However, this seems inefficient, since the logic of the model design is now reproduced unnecessarily in the Java code, and also restricts me to the kind of simple models that can be represented this way. For example, I could not do this for random forest regression.

Ideally, I would like a lightweight inference call to Spark MLLib within Java, without any of the overhead of Spark sessions, servers, APIs, URLs, etc.

Is there such a lightweight Spark Java function that I could use which allows the inference of a single instance? I don't imagine it being an uncommon situation that somebody needs the benefit of Hive and parallel processing for training, but just fast and simple inference with minimal overheads.

Topic java apache-spark

Category Data Science


Running

   local[*]

for your Spark master is not lightweight and fast enough - 8 cores blasting away on an intel core i7? Or you could find beefier machines to run it. Do you need sub-minute response times? If the problem is small enough to run on a single computer then spark is not bad for performance. It uses BLAS libraries: you can try out different BLAS backends to see if you get some extra boost if you like.

There is little benefit to convert to java - which libraries are you planning to use for your algorithms? In the end, they will run on the same jvm that spark is already using - and to its fullest extent: the standalone mode of spark will go and grab as much of the available local memory that it needs while leaving sufficient for the operating system and other running apps.

Note that all mllib routines use a jvm - even when pyspark is the programming language/interface. The pyspark communicates via a socket with py4j to the jvm for submitting the spark jobs.

Update You can disable the GUI via

-- conf spark.gui.enabled=false

on the command line or

conf.set("spark.gui.enabled","false")

That saves on startup time and also reduces the ports being used.


You can export the model (Weights + Hyper parameters) to a common format from Spark and then execute the same in another Language/Framework.

https://docs.databricks.com/spark/latest/mllib/model-export.html

Above link has examples for Python / Spark, should work for Java as well.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.