Does this make data leakage in time series? # need help for understanding time series data
Does this make data leakage in time series? I already read this, data leakage when scaling time series
Data leakage is when information from outside the training dataset is used to create the model.
assume the past day is 3, predicting day is 2
Does this lead to data leakage in time series? I am not sure about this.
Considering both figures both test Y is after train / valid Y, but test X is overlapping on train / valid time series. As this sliding window can use all the datasets.
According to the definition of data leakage, the model still does not know future Y in the test. I think there is no leakage.
1st sample: 1 to 3 days input = 4 to 5 days output
2nd sample: 2 to 4 days input = 5 to 6 days output
3nd sample: 3 to 5 days input = 6 to 7 days output
4th sample: 4 to 6 days input = 7 to 8 days output
5th sample: 5 to 7 days input = 8 to 9 days output
6th sample (test): 7 to 9 days input = 10 to 11 days output
also, the validation X and Y is overlapping on the train time series. does it lead to leakage too? should I shift one more column for the validation data (like the test sample)
Note no shuffle done in here, as I know it must lead to data leakage in time series
Topic data-leakage rnn preprocessing time-series data-mining
Category Data Science