Meaningfully compare target vs observed TPR & FPR
Suppose I have a binary classifier $f$ which acts on an input $x$. Given a threshold $t$, the predicted binary output is defined as: $$ \widehat{y} = \begin{cases} 1, f(x) \geq t \\ 0, f(x) t \end{cases} $$ I then compute the $TPR$ (true positive rate) and $FPR$ (false positive rate) metrics on the hold-out test set (call it $S_1$):
- $TPR_{S_1} = \Pr(\widehat{y} = 1 | y = 1, S_1)$
- $FPR_{S_1} = \Pr(\widehat{y} = 1 | y = 0, S_1)$
Next, say I deploy the classifier to production (i.e. acting on real-world data), and after two weeks I collect the results and have the data labelled (by a human, assume no errors in labelling). Call this data set $S_2$. I observe the following:
- total samples during this period = $|S_2|$ = $N$ negative + $P$ positive
- $TPR_{S_2} = \Pr(\widehat{y} = 1 | y = 1, S_2)$
- $FPR_{S_2} = \Pr(\widehat{y} = 1 | y = 0, S_2)$
My question is this:
Under what conditions / assumptions I can meaningfully compare the target TPR, FPR (as computed on the hold-out set $S_1$) to the observed TPR, FPR (as computed on the production data set $S_2$)? Or, at the very least, is there a relation between TPR, FPR on $S_1$ and TPR, FPR on $S_2$, does it even make sense to compare them?
My intuition is that the input distributions in $S_1$ and $S_2$ should be similar, but I need some help formalizing this concept.
Any tips and literature suggestions are greatly appreciated!
Topic binary-classification model-evaluations mlops
Category Data Science