Meaningfully compare target vs observed TPR & FPR

Suppose I have a binary classifier $f$ which acts on an input $x$. Given a threshold $t$, the predicted binary output is defined as: $$ \widehat{y} = \begin{cases} 1, f(x) \geq t \\ 0, f(x) t \end{cases} $$ I then compute the $TPR$ (true positive rate) and $FPR$ (false positive rate) metrics on the hold-out test set (call it $S_1$):

  • $TPR_{S_1} = \Pr(\widehat{y} = 1 | y = 1, S_1)$
  • $FPR_{S_1} = \Pr(\widehat{y} = 1 | y = 0, S_1)$

Next, say I deploy the classifier to production (i.e. acting on real-world data), and after two weeks I collect the results and have the data labelled (by a human, assume no errors in labelling). Call this data set $S_2$. I observe the following:

  • total samples during this period = $|S_2|$ = $N$ negative + $P$ positive
  • $TPR_{S_2} = \Pr(\widehat{y} = 1 | y = 1, S_2)$
  • $FPR_{S_2} = \Pr(\widehat{y} = 1 | y = 0, S_2)$

My question is this:

Under what conditions / assumptions I can meaningfully compare the target TPR, FPR (as computed on the hold-out set $S_1$) to the observed TPR, FPR (as computed on the production data set $S_2$)? Or, at the very least, is there a relation between TPR, FPR on $S_1$ and TPR, FPR on $S_2$, does it even make sense to compare them?

My intuition is that the input distributions in $S_1$ and $S_2$ should be similar, but I need some help formalizing this concept.

Any tips and literature suggestions are greatly appreciated!

Topic binary-classification model-evaluations mlops

Category Data Science


You can run A/B tests for TPRs and FPRs to test your statement that they come from the same distribution. The null hypothesis is that True Positives (and False Positives) obtained from S1 and S2 results come from the same distribution. A simple T-test could be a good start (or something like this Z-test if you have a lot of samples).


Your points totally make sense, as it is key to monitor models performance in an MLOps strategy. About these points, I would say:

  • the metric you might want to monitor to measure your model performance is ROC AUC (which uses your TPR and FPR)
  • yes, it makes sense to compare this ROC AUC value between training phase (with your holdout set S1) VerSus new ROC AUC values for new predictions on production (of course once you have the true labels to validate)

About monitoring when a model should be retrained or might be underperforming:

  • checking that your input distributions are still considered to be the same between training and inference on production, there is a concept called data drift via Population Stability Index (link) or using statistical tests like Kolmogorov-Smirnov test; this Google link discussing data drift detection on univariate and multivariate inputs
  • it is also interesting to track a possible concept drift (i.e. drift on the target variable)
  • checking your metric value (ROC AUC in this case), and decide a threshold below which you consider your model is underperforming (e.g. is your productive model metric 10% worse compared to the value when it was trained?)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.