Is data leakage giving me misleading results? Independent test set says no!
TLDR:
I evaluated a classification model using 10-fold CV with data leakage in the training and test folds. The results were great. I then solved the data leakage and the results were garbage. I then tested the model in an independent new dataset and the results were similar to the evaluation performed with data leakage.
What does this mean? Was my data leakage not relevant? Can I trust my model evaluation and report that performance ?
Extended version:
I'm developing a binary classification model using a dataset 108326x125 (observations x features) with a class imbalance of ~1:33 (1 positive observation for each 33 negative observations). However, those 108326 observations came from only 95 subjects, which means there are more than one observation from each subject.
Initially, during model training and evaluation, I performed cross-validation (CV) using the command (classes
is a column array with the class of each observation):
cross_validation = cvpartition(classes,'KFold',10,'Stratify',true);
and obtained good performance in terms of the metrics that interested me the most (recall and precision). The model was an ensemble (boosted trees).
However, by performing the above CV partition, I have data leakage since observations from a same subject might be simultaneously in the training and test sets in a given CV iteration. Besides data leakage, I believed my model could be overfitting since the optimal hyperparameter for each tree maximum number of leaves in the ensemble is around 350 and usually this parameter should be something around 8 and 32 (to avoid each tree to become a strong learner, which might lead to overfitting).
Then, I performed a different CV partition where I make the partition by subject ID, which solves the data leakage problem. However, when doing this, the classes distribution might become very different in the training and test sets (since around 30 subjects do not have positive observations, there's even the extreme case where some test folds might end up having 0 positive observations), which influences my evaluation of the model's performance. To mitigate this, I performed repeated 5x10-CV.
With this partition, I've tested several different model types (including the exact same model and hyperparameters as the one with data leakage), such as MATLAB's fitcensemble and KNN and Python's XGBoost (and performed hyperparameters optimization in all of them) and no matter what I do, I simply cannot reach acceptable performance with this approach. Therefore, my first questions are:
1. Is there something wrong with this partitioning that might be influencing my model evaluation? (see code below)
2. Do you have any suggestion to improve this CV partitioning?
Finally, to confirm my initial model evaluation (with data leakage in the partition) was misleading me, I tested the model in an independent new dataset (however way smaller) and the performance was good (similar to the one obtained through the CV partition)!!
What does this mean? Was my data leakage not relevant? Can I trust my model evaluation and report that performance ?
Thank you in advance!
*Code for subject-based partition
% Randomize subjects list order to introduce randomization to the partition process
data_name_list = data_name_list(randperm(length(data_name_list)));
% Get array containing the corresponding fold of each subject ('histcounts' splits subjects as
% uniformly as possible when using BinWidth like this)
[~,~,fold_of_subject] = histcounts(1:length(data_name_list), 'BinWidth', length(data_name_list)/num_folds);```