Incremental Learning with sklearn: warm_start, partial_fit(), fit()

I have built an ML model with the goal of making predictions for targets of the following week. In general, new data will come in and be processed at the end of each week and be in the same data structure as before. In other words, the same number of features, same classes for classification, etc.

Instead of re-training the model from scratch for each week's predictions, I am considering applying an incremental learning approach so that past learning is not entirely discarded and the model would (presumably) increase in performance over time. I'm working with sklearn on Python 3. There were only a handful of posts on StackOverflow regarding this, but many of the answers seem inconsistent (possibly due to updates with sklearn's API?).

The documentation here and here suggests that incremental/online learning is possible with certain ML implementations - implying that the new datasets could be thought of as mini-batches and incrementally trained by saving/loading the model and calling .partial_fit() with the same model parameters.

Although all algorithms cannot learn incrementally (i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates. 1

Unlike fit, repeatedly calling partial_fit does not clear the model, but updates it with respect to the data provided. The portion of data provided to partial_fit may be called a mini-batch. Each mini-batch must be of consistent shape, etc. In iterative estimators, partial_fit often only performs a single iteration. 2

However, the documentation here is throwing me off.

partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed. 3

There are cases where you want to use warm_start to fit on different, but closely related data. 3

For the problem I am tackling, ideally model parameters should be adjusted based on cross-validation and new datasets should be weighted more heavily than old ones due to concept drift. However ignoring this for now,

  1. In (3), what exactly does (more-or-less) constant...different but closely related data mean? Since the data structure of new datasets are the same, should i be calling estimator(warm_start=True).fit(#new df) or estimator.partial_fit(#new df)?
  2. For iterative estimators such as sklearn.linear_model.SGDClassifier, only one epoch is run when using .partial_fit(). If I want $k$ epochs, would calling it on the same dataset repeatedly be the same as calling .fit() with $k$ epochs to begin with?
  3. Do dedicated libraries such as creme offer any advantage for incremental learniing?

Topic online-learning scikit-learn python machine-learning

Category Data Science


Answering my own question after some investigation:

  • warm_start=True and calling .fit() sequentially should not be used for incremental learning on new datasets with potential concept drift. It simply uses the previously fitted model's parameters to initialize a new fit, and will likely be overwritten if the new data is sufficiently different (i.e. signals are different). After a few mini-batches with large enough sample size (datasets in my case), the overall performance converges to exactly that of simply re-initializing the model. My guess is that this method should be used for the primary purpose of reducing training time when fitting the same dataset, or when there is no significant concept drift in new data.
  • partial_fit on the other hand has an effect and can be used for incremental learning (especially for datasets too large to fit into memory and feeding in mini-batches). However, in datasets with potential concept drift or high noise, it performs worse than disregarding past observations and simply fitting on each dataset without any incremental learning.
  • For SGDClassifier, calling partial_fit repeatedly makes a difference.

Edit (2022)
This post/answer has gotten a lot more views than expected, and I thought I'd expand a bit on my previous answer.

Let's say your ML model is a very simple linear regression, $$ y = Wx + b $$ where $W, b$ are the weights and biases, and $x$ the input/features.

And let's say that you've trained the model so that you've obtained some estimates for $\hat{W}, \hat{b}$ on some initial dataset $D_0$. Now you've obtained another dataset $D_1$.

Using warm_start=True and .fit() simply uses $\hat{W}, \hat{b}$ as an initialization for the parameters to be optimized on $D_1$. This can reduce the training time, especially if the datasets $D_1$ and $D_0$ are assumed to be generated from the same underlying data generating process ("more-or-less constant" in the docs).

On the other hand,partial_fit is for incrementally updating the parameters. So if you trained the model on $D_0$, and then partial_fit on $D_1$, this would be conceptually similar to training a fresh model on a combined dataset.

The distinction can be pretty subtle so here's an example. Let's say you are training a classifier on the Iris dataset. Now suppose you went out and collected more data on the flowers. If you think there is concept drift (the flowers have evolved and are slightly different) or perhaps only care about the new data, then using warm_start lets you train the model on the new data faster than training from scratch with random initialization.

On the other hand, let's say you are building a music recommendation system where users provide feedback on whether the recommendation was good or not. Then you can use partial_fit to incrementally update the model as the live data comes in.


According to the glossery entry for partial_fit:

Generally, estimator parameters should not be modified between calls to partial_fit, although partial_fit should validate them as well as the new mini-batch of data. In contrast, warm_start is used to repeatedly fit the same estimator with the same data but varying parameters.

So the practical implication of this is that warm start is best used for a parameter search on a fixed dataset. The reasoning being that the warm start should increase the speed of convergence. This more or less echos the previous answer by @oW_, but I wanted to expand on the reasoning behind the use case.


Just to add another, hopefully clarifying example: You may have fitted 100 trees in a random forest model and you want to add 10 more. Then you can achieve this by setting estimator.set_params(n_estimators=110, warm_start=True) and calling the fit method of the already fitted estimator. It typically would not make sense to fit the first 100 trees on one part of the data and the next 10 trees on a different part. Warm start doesn't change the first 100 trees.

Similarly for GradientBoostingClassifier you can add more boosted trees using warm_start. You wouldn't want an additional boosted tree to be fitted on a different mini-batch. This would result in a chaotic learning process.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.