How does a data leakage work?
I'm working with panel data - every row represents a timestamp (observation) and there are multiple rows for a single timestamp (around 20 rows each). I have a total of 8719 unique timestamps.
Obs_temp is the target column. 1 column represents the hour. Every timestamp has 20 different observations (with different feature values but same target value).
When I randomly split the data into train test and predict, Random forest and KNN scored 0.55 and 0.0002 MAE respectively. (Baseline MAE=1.97) Which I was expecting since 20 rows for the same timestamp could end up on both train and the test set.
When I dropped all the columns related to time, they still manage to score almost perfectly. So my question is, how does Random Forest knows that it already has some of the test observations in train set?
Edit: Sample dataset was updated with rows with the same timestamp (filtered on 2 different timestamps).
Topic data-leakage regression random-forest time-series
Category Data Science