Incremental training and Auto Machine Learning for big datasets
I built a NLP sentence classifier, which uses vectors from word embedding as features.
Training dataset is big (100k sentences). Every sentence has 930 features.
I found the best model using an auto machine learning library (auto-sklearn); the training required 40 GB of RAM and 60 hours. The best model is an ensemble of the top N models found by this library.
Occasionally, I need to add some data to the training set and update the training. Since this autoML library isn't suitable for incremental training, every time I need to do complete retraining, using more and more memory and time.
How to address this issue? How to do incremental training? Should I quit the usage of this library? For memory and time usage, would it be better to parallelize the training?
Topic automl training machine-learning
Category Data Science