How can I measure the reliability of the specificity of a model with very small train, test, and validation datasets?

Stats newbie here. I have a small dataset of 646 samples that I've trained a reasonably performant model on (~99% test and val accuracy). To complicate things a little bit, the classes are somewhat unbalanced. It's a binary classification problem.

Here is my confusion matrix on training data

[[387   1]
 [  1  73]]

on testing data:

[[74  1]
 [ 0 10]]

on validation data:

[[85  1]
 [ 0 13]]
  1. Training Specificity: .986
  2. Testing Specificity: .909
  3. Validation Specificity: .928

My thoughts are that testing and validation have a very low specificity while training has a comparatively high specificity. However, given that only one sample is missed in both the testing and validation datasets, what is my real-world specificity? Is there a better generalizability measure? Is there something akin to a p-value that relates the reliability of the specificity given the size of the negative sample class?

Thanks!

Topic generalization statistics machine-learning

Category Data Science


Real world data is "test dataset", right? Data has to be divided in such a way that train-validation see part of data more than once while test data will be seen only once. In that sense, if the model is robust enough, it will perform well even on the test dataset. The assumption is that test data is as close as possible to real-world data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.