Best practices for scoring hundreds of models on the same massive dataset?

Question

Best practices for scoring hundreds of models on the same massive dataset?

blacksite

2022年4月27日 05:04

I have 500+ models predicting various things and a massive database of over 400m+ individuals and about 5,000 possible independent variables.

Currently, my scoring process takes about 5 days, and operates by chunking up the 400m+ records into 100k-person pieces and spinning up n-number of threads, each with a particular subset of the 500+ models, and running this way until all records are scored for all models. Each thread is a Python process which submits R code (i.e. loads an R .rds model and associated dataset transformation logic).

This process takes way too long, is severely error-prone (more of an indicator of the tangled web of code it's become), is expensive (massive cloud instance required), and only allows for models to be built in R (I want to basically be agnostic of the language from which the model is coming, but mainly I want to enable Python and R – that is a non-negotiable requirement).

Does anyone with experience in a similar problem domain have any advice re: how this process could be re-architected to 1) run more efficiently (from a $ PoV) and 2) enable both Python and R models.

Topic scoring

Category Data Science

Salio · Accepted Answer · 2021年7月26日 05:56

You can use Python Frameworks for Parallel and Distributed Machine Learning Tasks

for examle:Elephas is an extension of Keras, which allows you to run distributed deep learning models at scale with Spark. Elephas intends to keep the simplicity and high usability of Keras, thereby allowing for fast prototyping of distributed models, which can be run on massive data sets. Installation:

pip install elephas

for more education go to : 10 Python Frameworks for Parallel and Distributed Machine Learning Tasks

Brian Spiering · Accepted Answer · 2021年6月24日 21:42

Since in the comments you mentioned making predictions based on trained models, the number data examples is not a factor. Training data can be ignored; You only need to use the trained model architecture and weights.

You probably want to use an existing distributed machine learning framework such as Spark or H20. That framework will hand distributing the predictions across the cluster and aggregating the results.

Best practices for scoring hundreds of models on the same massive dataset?

About