Machine Learning models in production environment

Lets say a Model was trained on date $dt1$ using the available labeled data, split into training and test i.e $train_{dt1}$, $test_{dt1}$. This model is then deployed in production and makes predictions on new incoming data. Some $X$ days pass, and there is bunch of labelled data that is collected in between $dt1$ and $dt1 + X$ days, lets call it $Data_x$. In my current approach, I take random samples out of $DATA_x$ (take for e.g 80/20 split) ,

So, $80\%$ of $DATA_x$ = $train_x$ (new data used to fine-tune the existing model trained on $dt1$) $20\%$ of $DATA_x$ = $test_x$ (new data added to the $test_{dt1}$)

This process of fine-tuning repeated as time passes.

By doing this I get an ever expanding test set, as well as I prevent retraining the whole model (essentially I can throw away the old data as the model has learnt from it). The new model generated is just a fine-tuned version of the old one.

I have some questions, regarding this approach:

  1. Are there any obvious drawbacks in doing this?
  2. Would the model ever need to be completely retrained (forgetting everything that was learnt before, and training the model with new train/test splits ) after some time or can the approach I described above continue indefinitely ?
  3. What should be the condition for swapping the existing deployed model with the newly fine-tuned model ?

Topic data-product model-selection cross-validation machine-learning

Category Data Science


It mainly depends on the kind of learning your ml algorithm does. For Offline learning: re training the whole thing is wise as some algorithm require your full data to generate better assumption. Online learning : Your model can be fine tuned to the recent or latest data with update in model as the data arrives.


I think this is a good approach in general. However:

  • Fine-tuning your model (online learning) depends a lot on the algorithm and model how well this works. Depending on your algorithm it might be wise to retrain the whole thing

  • Your sample space might change overtime. If you have enough data maybe retraining every few days/weeks/months over only the last year worth of data might be better. If your old samples don't represent the current situation as well having them included might hurt your performance more than the extra samples help

  • The biggest condition is if it's tested and how much downtime it involves, but in general swapping more times is better, and this can be automated

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.