Incremental training and Auto Machine Learning for big datasets

I built a NLP sentence classifier, which uses vectors from word embedding as features.

Training dataset is big (100k sentences). Every sentence has 930 features.

I found the best model using an auto machine learning library (auto-sklearn); the training required 40 GB of RAM and 60 hours. The best model is an ensemble of the top N models found by this library.

Occasionally, I need to add some data to the training set and update the training. Since this autoML library isn't suitable for incremental training, every time I need to do complete retraining, using more and more memory and time.

How to address this issue? How to do incremental training? Should I quit the usage of this library? For memory and time usage, would it be better to parallelize the training?

Topic automl training machine-learning

Category Data Science


First of all using auto-sklearn, you can use

automl.fit(X_train, y_train, dataset_name='X_train',
               feat_type=feature_types)

    print(automl.show_models())

so you can extract the instance of the best model from the first fitting. However in order to learn incrementally you have to (in case of sklearn models) have fit_partially method. Naive Bayes varaints and other algorithms here have this functionality. So you are out of luck if these are not in the output of show_models: In this case you ought to do your own automated ml targeted on fit_partial models.

Alternative is using spark it has some cool streaming (incremental learning algos) StreamingKMeans, StreamingLinearRegressionWithSGD, StreamingLogisticRegressionWithSGD and generally StreamingLinearAlgorithm.

To conclude, I would not use auto-sklearn if these are your problems and choose some alternatives that do work parallel.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.