Is it possible to detect drift with real time predictions?

I have been reading up on detecting data drift and concept drift, I have found this library but it seems all the methods here detect concept drift and take input as if the prediction was correct or not. (Requiring ground truth) is this the correct assumption?

Then I stumbled on Kullback-Leibler Divergence and JS-Divergence. Can I use these methods to detect data drift in real time? (ex: request comes into my models API and then the prediction is made. I then take the features and pass it to the function calculating the drift)

Some of my concerns are do I need the full training data to compare to? As I understand these algorithms need the same size of data to compare, so would I need a data set the same size as my training data? Even understanding what the inputs used to detect data drift vs concept drift vs covariate shift would helpful.

Topic concept-drift data machine-learning

Category Data Science


You can detect drift in new predictions, probably not in real-time but accumulating predictions, to be able to detect relevant drift patterns and not just outliers.

I suggest you take a look at package drifter_ml. In the list of supported approaches for classification, you can find a section called "Against New Predictions", which contains the following methods:

  • proportion of predictions per class
  • class imbalance tests
  • probability distribution similarity tests
  • calibration tests

As you can understand from their descriptions, you don't need the full training data, but some statistics of it or even a representative subset, so that you have something to compare against.


You may or may not use ground truth to detect drift.

According to google:

What is data drift? Data drift is one of the top reasons model accuracy degrades over time. For machine learning models, data drift is the change in model input data that leads to model performance degradation. Monitoring data drift helps detect these model performance issues.

In predictive analytics and machine learning, the concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes.

Covariate shift is the change in the distribution of the covariates specifically, that is, the independent variables.

So I see data drift and covariate shift very similar if not equivalent. According to these definitions:

  • You need the ground truth to measure concept drift.
  • You don't need the ground truth to measure data drift.

In order to measure data drift:

  • You may or may not need all the training data. If you model the predictors (let's say I fit a gaussian variable to my feature $x_i$, and it has mean $\mu_i$ and standard deviation $\sigma_i$) and save the parameters of their distribution it might be enough to summarize the distribution and you don't need the full training data.
  • I don't think you need to have the same sample size for the serving data at all.
  • Data drift needs to be done in a batch way, so you can do it with your API as long as you store the results and analyze them after a period of serving time. It doesn't make sense to say that a single observation has drift unless in very extreme cases.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.