Discrepancy between cross-validation and un-seen data predictions
I am facing an issue with an imbalanced dataset. The dataset contains 20% targets and 80% non-targets. I am expecting a confusion matrix below when I give un-seen data to the trained model.
[[1200 0 ]
[0 240]]
In reality I am getting a confusion matrix below. As you must have observed, it is classifying very less targets.
[[1133 67]
[ 227 13]]
The training and validation curve of a CNN model looks like below.
Any thoughts on, why so less targets get classified even though the training and validation goes quite well! Am I missing something here? I tried changing the CNN model parameters (the kernel size, dropouts, number of CNN layers, early stopping etc). However, I donot see much change.
I read the below post on stack exchange about data leakage. However, (hopefully) that should not be the case with my code. Why k-fold cross validation (CV) overfits? Or why discrepancy occurs between CV and test set?
Topic cnn prediction tensorflow cross-validation confusion-matrix
Category Data Science