Monitoring machine learning models in production

I am looking for tools that allow me to monitor machine learning models once they are gone to production. I would like to monitor:

  1. Long term changes: changes of distribution in the features with respect to training time, that would suggest retraining the model.
  2. Short term changes: bugs in the features (radical changes of distribution).
  3. Changes in the performance of the model with respect to a given metric.

I have been looking over the Internet, but I don't see any in-depth analysis of any of the cases. Can you provide me with techniques, books, references or Software?

Topic data-product machine-learning

Category Data Science


There are few good startups and open sources that offer solutions for ML monitoring (I actually work at a startup in the field).

You can find here a few comparison tools to compare between some of them according to different features. I recommend the airtable by Ori on the top of the list, and mlops.toys (This is an open-source created by some of my colleagues so maybe I'm biased, but I love it).

The MLCompendium is, in general, a good source for information in many subjects in the ml field.

I really can't recommend the best tool for you because it depends on your exact needs:

  • Do you look for monitoring on the way as part of a full pipeline tool, or some super-advanced tool specifically for monitoring to expand your existing pipeline?
  • Do you work with Tabular data? NLP? Vision?
  • What is the frequency of your predictions?
  • Do you need to monitor all your data or just a segment of it?
  • etc...

In addition, this short blog post a colleague of mine wrote on Concept Drift Detection Methods may help you as well. You can find many more articles on the subject in the link to the MLCompendium I attached above.


The changes in distribution with respect to training time are sometimes referred to as concept drift.

It seems to me that the amount of information available online about concept drift is not very large. You may start with its wikipedia page or some blog posts, like this and this.

In terms of research, you may want to take a look at the scientific production of João Gama, or at chapter 3 of his book.

Regarding software packages, a quick search reveals a couple of python libraries on github, like tornado and concept-drift.

Update: recently I came across ML-drifter, a python library that seems to match the requirements of the question for scikit-learn models.


There is all kinds of solutions right now. Mainly you can divide it into two:

  1. Monitoring features as part of a bigger AI platform
  2. A dedicated monitoring solution

Few factors to examine before choosing between the two options:

  • What is your scale of your usage in ML models?
  • What's the impact of your models? are they part of your core business or is it only enrichment \ niche of your business?
  • What is your DS team size?
  • How many platforms do you use to deploy models to production? Do you have only one standard way to deploy?

The general theme is, the bigger the ml operation is, and the more you need it to be agnostic to the deployed platform, go for a dedicated solution. If your ml operations are still very limited and your serving platform already has few monitoring features in place, so it might good enough for you for now.

When examining a specific solution, consider the following points:

Integration - How complicated is it?

Measurement - Does it offer both data (input \ inference \ label) stability measurement?

Performance analysis - Does it provide you the ability to close the loop and see performance analytics (BTW... in most cases, even if you can get performance metrics, you probably won't be able to base your monitoring on top of it, cause in reality such performance information usually available only with delay time after the predictions were made).

Resolution - Can the system detect and measure such metrics on a higher resolution? (sub-segments of your entire datasets)? In many cases, drift or technical issues will occur only for a specific subset of your data.

Alerts - Does the solution include also a statistical alert mechanism? Eventually, it's hard to track all the KPIs mentioned above, and every dataset behaves differently, so thresholds are hard to define.

Dashboard - Does the solution contain a clear UI dashboard?

API - Can you consume such production insights directly from API? It can be very beneficial to build automation on top of it.

BTW... Here is a blog post, I wrote, talking about the different elements that should be converted when monitoring ml and reviewing current solutions


While reading this Nature paper:Explainable AI for Trees: From Local Explanations to Global Understanding. The section 2.7.4 "Local model monitoring reveals previously invisible problems with deployed machine learning models", says the following:

Deploying machine learning models in practice is challenging because of the potential for input features to change after deployment. It is hard to detect when such changes occur, so many bugs in machine learning pipelines go undetected, even in core software at top tech companies [78]. We demonstrate that local model monitoring helps debug model deployments by decomposing the loss among the model’s input features and so identifying problematic features (if any) directly. This is a significant improvement over simply speculating about the cause of global model performance fluctuations

Then they do 3 experiments with the Shapley values provided by the TreeExplainer

  1. We intentionally swapped the labels of operating rooms 6 and 13 two-thirds of the way through the dataset to mimic a typical feature pipeline bug. The overall loss of the model’s predictions gives no indication that a problem has occurred (Figure 5A), whereas the SHAP monitoring plot for room 6 feature clearly shows when the labeling error begins

  2. Figure 5C shows a spike in error for the general anesthesia feature shortly after the deployment window begins. This spike corresponds to a subset of procedures affected by a previously undiscovered temporary electronic medical record configuration problem (Methods 17).

  3. Figure 5D shows an example of feature drift over time, not of a processing error. During the training period and early in deployment, using the ‘atrial fibrillation’ feature lowers the loss; however, the feature becomes gradually less useful over time and ends up hurting the model. We found this drift was caused by significant changes in atrial fibrillation ablation procedure duration, driven by technology and staffing changes

Current deployment practice is to monitor the overall loss of a model over time, and potentially statistics of input features. TreeExplainer enables us to instead directly allocate a model’s loss among individual features


You can have a look at Anodot's MLWatcher. Few of the highlights of this tool are as follows.

  • MLWatcher collects and monitors metrics from machine learning models in production.
  • This open source Python agent is free to use, simply connect to your BI service to visualize the results.
  • To detect anomalies in these metrics, either set rule-based alerting, or sync to an ML anomaly detection solution, such as Anodot, to execute at scale.
  • Distribution of input features.

You can have a look at their complete features here.


What you're describing is known as concept drift and there are quite a few software startups bringing a solution to market (us included - happy to show you what we have).

  1. A very simplistic way of detecting drift is monitoring the differences between distributions of the predicted dataset and the training dataset using a Kolmogorov-Smirnov test or Wasserstein distance.

  2. For radical changes in distribution, what you might do is create a model to understand the datasets unique patterns and have an outlier detector to determine true radical changes to the distribution as opposed to also identifying false positives.

  3. This is an interesting use case - are you able to share an example?


If I understand your query correctly,you are looking for MLFLOW where you can track your experimentation and vizualize them using APIs MLFLOW

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.