BERT in production

I've created a BERT model. What are the ways to do the deployment of this model? Is it possible to use it with Spark, Hadoop or Docker?

Topic bert apache-spark apache-hadoop

Category Data Science


You can just apply it with Spark. There is no reason you can't use Pytorch in a Spark job; just add it as a dependency when you submit the job. Spark's pandas UDFs can be pretty useful for scoring large models as they let you score in mini batches. See https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#scalar-iterator

One complication is that you can use GPUs in Spark 2.x, but can't allocate GPUs as resources. So you may have multiple tasks on one GPU, and need to tune a little bit to reduce contention. Spark 3 however will have GPU resource allocation.

Hadoop isn't a thing that runs computations, unless you mean MapReduce, which is obsolete, or if you mean Spark, which is above.

Docker is also an option; just bottle up your scoring code and run on a cluster. You don't really get the same help with data movement and access that you would in Spark; it's all up to you. But sure that can work.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.