How can I measure the reliability of the specificity of a model with very small train, test, and validation datasets?
Stats newbie here. I have a small dataset of 646 samples that I've trained a reasonably performant model on (~99% test and val accuracy). To complicate things a little bit, the classes are somewhat unbalanced. It's a binary classification problem.
Here is my confusion matrix on training data
[[387 1]
[ 1 73]]
on testing data:
[[74 1]
[ 0 10]]
on validation data:
[[85 1]
[ 0 13]]
- Training Specificity: .986
- Testing Specificity: .909
- Validation Specificity: .928
My thoughts are that testing and validation have a very low specificity while training has a comparatively high specificity. However, given that only one sample is missed in both the testing and validation datasets, what is my real-world specificity? Is there a better generalizability measure? Is there something akin to a p-value that relates the reliability of the specificity given the size of the negative sample class?
Thanks!
Topic generalization statistics machine-learning
Category Data Science