Is data leakage giving me misleading results? Independent test set says no!

TLDR:

I evaluated a classification model using 10-fold CV with data leakage in the training and test folds. The results were great. I then solved the data leakage and the results were garbage. I then tested the model in an independent new dataset and the results were similar to the evaluation performed with data leakage.

What does this mean? Was my data leakage not relevant? Can I trust my model evaluation and report that performance ?


Extended version:

I'm developing a binary classification model using a dataset 108326x125 (observations x features) with a class imbalance of ~1:33 (1 positive observation for each 33 negative observations). However, those 108326 observations came from only 95 subjects, which means there are more than one observation from each subject.

Initially, during model training and evaluation, I performed cross-validation (CV) using the command (classes is a column array with the class of each observation):

cross_validation = cvpartition(classes,'KFold',10,'Stratify',true);

and obtained good performance in terms of the metrics that interested me the most (recall and precision). The model was an ensemble (boosted trees).

However, by performing the above CV partition, I have data leakage since observations from a same subject might be simultaneously in the training and test sets in a given CV iteration. Besides data leakage, I believed my model could be overfitting since the optimal hyperparameter for each tree maximum number of leaves in the ensemble is around 350 and usually this parameter should be something around 8 and 32 (to avoid each tree to become a strong learner, which might lead to overfitting).

Then, I performed a different CV partition where I make the partition by subject ID, which solves the data leakage problem. However, when doing this, the classes distribution might become very different in the training and test sets (since around 30 subjects do not have positive observations, there's even the extreme case where some test folds might end up having 0 positive observations), which influences my evaluation of the model's performance. To mitigate this, I performed repeated 5x10-CV.

With this partition, I've tested several different model types (including the exact same model and hyperparameters as the one with data leakage), such as MATLAB's fitcensemble and KNN and Python's XGBoost (and performed hyperparameters optimization in all of them) and no matter what I do, I simply cannot reach acceptable performance with this approach. Therefore, my first questions are:

1. Is there something wrong with this partitioning that might be influencing my model evaluation? (see code below)

2. Do you have any suggestion to improve this CV partitioning?

Finally, to confirm my initial model evaluation (with data leakage in the partition) was misleading me, I tested the model in an independent new dataset (however way smaller) and the performance was good (similar to the one obtained through the CV partition)!!

What does this mean? Was my data leakage not relevant? Can I trust my model evaluation and report that performance ?

Thank you in advance!


*Code for subject-based partition

% Randomize subjects list order to introduce randomization to the partition process
data_name_list = data_name_list(randperm(length(data_name_list)));

% Get array containing the corresponding fold of each subject ('histcounts' splits subjects as
% uniformly as possible when using BinWidth like this)
[~,~,fold_of_subject] = histcounts(1:length(data_name_list), 'BinWidth', length(data_name_list)/num_folds);```

Topic model-evaluations data-leakage overfitting model-selection machine-learning

Category Data Science


First off, subject-based partition is a great start to solving the leakage, just need a couple more steps.

From what I understand, you have an imbalanced data (1:33) that is causing the great scores on the new dataset. I'm assuming this new and the way smaller dataset has the similar ratio of 1:33. That means if your model constantly predicts the output as negative, it would be right most of the time (33/34 predictions).

This is a common scenario in domains like fraud detection, healthcare data, spam filtering etc. My advice would be to handle the imbalanced data first; either by under-sampling (removing some negative subjects) or over-sampling (creating more positive subjects).

I would suggest under-sampling if you have plenty of data, if not then you can try an over-sampling technique like SMOTE to generate synthetic data for the minority class (positive subjects).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.