Machine Learning models in production environment
Lets say a Model was trained on date $dt1$ using the available labeled data, split into training and test i.e $train_{dt1}$, $test_{dt1}$. This model is then deployed in production and makes predictions on new incoming data. Some $X$ days pass, and there is bunch of labelled data that is collected in between $dt1$ and $dt1 + X$ days, lets call it $Data_x$. In my current approach, I take random samples out of $DATA_x$ (take for e.g 80/20 split) ,
So, $80\%$ of $DATA_x$ = $train_x$ (new data used to fine-tune the existing model trained on $dt1$) $20\%$ of $DATA_x$ = $test_x$ (new data added to the $test_{dt1}$)
This process of fine-tuning repeated as time passes.
By doing this I get an ever expanding test set, as well as I prevent retraining the whole model (essentially I can throw away the old data as the model has learnt from it). The new model generated is just a fine-tuned version of the old one.
I have some questions, regarding this approach:
- Are there any obvious drawbacks in doing this?
- Would the model ever need to be completely retrained (forgetting everything that was learnt before, and training the model with new train/test splits ) after some time or can the approach I described above continue indefinitely ?
- What should be the condition for swapping the existing deployed model with the newly fine-tuned model ?
Topic data-product model-selection cross-validation machine-learning
Category Data Science