How does a data leakage work?

I'm working with panel data - every row represents a timestamp (observation) and there are multiple rows for a single timestamp (around 20 rows each). I have a total of 8719 unique timestamps.

Obs_temp is the target column. 1 column represents the hour. Every timestamp has 20 different observations (with different feature values but same target value).

When I randomly split the data into train test and predict, Random forest and KNN scored 0.55 and 0.0002 MAE respectively. (Baseline MAE=1.97) Which I was expecting since 20 rows for the same timestamp could end up on both train and the test set.

When I dropped all the columns related to time, they still manage to score almost perfectly. So my question is, how does Random Forest knows that it already has some of the test observations in train set?

Edit: Sample dataset was updated with rows with the same timestamp (filtered on 2 different timestamps).

Topic data-leakage regression random-forest time-series

Category Data Science


My reccomendation will be to see what are the important features of the model. Probably use SHAP if you are on random forest https://github.com/slundberg/shap

You might see that one features is extremely highly relevant. This can be a data leakage.

One of the goals of explainable AI is model debugging.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.