Scalable training/updating of many small LSTM models

Question

Scalable training/updating of many small LSTM models

NMR

2016年8月31日 13:52

My situation is that I have many thousands of devices which each have their own specific LSTM model for anomaly prediction. These devices behave wildly differently so I don't think there is any way to have a shared global model, unfortunately. Periodically I will update each device model with the new data from the device - so maybe once per day I will load an additional daily batch of readings and use the properties of stateful LSTM training to update the model.

Each of these models is very small, containing at most 10-20 thousand readings in their whole history, and fitting into memory on even a modest GPU. Training from scratch is relatively quick, and updating a single batch should be even more so, as typically there are only 48 new readings/day (30 minute intervals), but going as high as 1440 readings/day.

My question is - what is an optimal architecture for handling all of these models and updates? They are all independent and small, so I don't see a need for anything distributed, but there are so many of them that I am unsure how to proceed. I am thinking that perhaps using something like an AWS 'large' GPU cluster, with each GPU loading a serialized model from the DB, updating with a new batch of data, and writing back could work - however, I suspect that having to repeatedly compile the models will be a blocker and that being able to do only 4 at a time is woefully insufficient.

Alternatively, it may be possible to do the 'initial' training on some kind of powerful GPU rig, and then do the batch updates on CPU's, which would make it easier to run many instances in parallel. Is this an example of where Spark might be useful?

Any advice is appreciated - including 'your premise is absurd, why would you do this!' This is only one of many ways available to do the anomaly detection, so if the architecture implementation is unfeasible, I will abandon the LSTM's altogether and try something simpler.

Topic apache-spark parallel scalability machine-learning

Category Data Science

Scalable training/updating of many small LSTM models

About