Discrepancy between cross-validation and un-seen data predictions

Question

Discrepancy between cross-validation and un-seen data predictions

Swati Shah

2022年2月6日 10:53

I am facing an issue with an imbalanced dataset. The dataset contains 20% targets and 80% non-targets. I am expecting a confusion matrix below when I give un-seen data to the trained model.

[[1200   0  ]
 [0      240]]

In reality I am getting a confusion matrix below. As you must have observed, it is classifying very less targets.

[[1133   67]
 [ 227   13]]

The training and validation curve of a CNN model looks like below.

Any thoughts on, why so less targets get classified even though the training and validation goes quite well! Am I missing something here? I tried changing the CNN model parameters (the kernel size, dropouts, number of CNN layers, early stopping etc). However, I donot see much change.

I read the below post on stack exchange about data leakage. However, (hopefully) that should not be the case with my code. Why k-fold cross validation (CV) overfits? Or why discrepancy occurs between CV and test set?

Topic cnn prediction tensorflow cross-validation confusion-matrix

Category Data Science

Swati Shah · Accepted Answer · 2022年2月6日 10:53

I could figure out the issue with my code. It was "data leakage".

My earlier sequence of pre-processing code was below:

Load the data
Augment the data
Balance the data
Shuffle the data
Split the data between "Train" and "Validation"
Saturate the "Train" outliers to a constant value
Saturate the "Validation" outliers to a constant value
Normalize the data between 0 and 1
Feed the "Train" and "Validation" data to the model
Plot the training and validation graph

As you can see in the above sequence the information was leaking from the validation data to the training data. Hence I was getting a perfect picture of converging model (as graph posted in the original question). I modified the sequence of pre-processing as below:

Load the data
Split the data between "Train" and "Validation"
Augment the "Train" data
Balance the "Train" data
Shuffle the "Train" data
Shuffle the "Validation" data
Saturate the "Train" outliers to a constant value
Saturate the "Validation" outliers to a constant value
Normalize the "Train" data between 0 and 1
Normalize the "Validation" data between 0 and 1
Feed the "Train" and "Validation" data to the model
Plot the training and validation graph

Now the graph gives me much more realistic picture of accuracy and loss in the CNN. See below. It looks like after 100 iterations, accuracy remains constant (around 70%) and loss blows up. I do not know how to fix the exploded loss. However, at least I am not getting a false picture now.

Discrepancy between cross-validation and un-seen data predictions

About