Best practices for scoring hundreds of models on the same massive dataset?
I have 500+ models predicting various things and a massive database of over 400m+ individuals and about 5,000 possible independent variables.
Currently, my scoring process takes about 5 days, and operates by chunking up the 400m+ records into 100k-person pieces and spinning up n-number of threads, each with a particular subset of the 500+ models, and running this way until all records are scored for all models. Each thread is a Python process which submits R code (i.e. loads an R .rds
model and associated dataset transformation logic).
This process takes way too long, is severely error-prone (more of an indicator of the tangled web of code it's become), is expensive (massive cloud instance required), and only allows for models to be built in R (I want to basically be agnostic of the language from which the model is coming, but mainly I want to enable Python and R – that is a non-negotiable requirement).
Does anyone with experience in a similar problem domain have any advice re: how this process could be re-architected to 1) run more efficiently (from a $ PoV) and 2) enable both Python and R models.
Topic scoring
Category Data Science