Look ahead bias predicting a time series using features

I am making some ML methods (RF, RNN, MLP) to predict a time series value 'y' based on features 'X' and not the time series 'y' itself. My question is regarding the bias I might be including since I am doing a simple random train-test-split for the fit and evaluation process, so I am using data from different days (past and future) and not spliting by time. Is it valid for this prediction process, or even that I am not using the time series to predict future values, I am still introducing bias since I am using features which also are time series themselves since I go them daily. I tried both ways and got far better results using the simple random train-test-split, so I got suspicious.

Topic bias rnn time-series

Category Data Science


The answer depends on whether there is autocorrelation in the y target variable which is not accounted for in your X regressors. In other words: if you know all the X values and you are trying to predict the current time step y value, would it help you at all to know the previous time steps' y values? If so, there is autocorrelation in y that is not reducible to your X features, and your simple random train-test-split is not advisable.

A simple example: suppose X is just temperature, and y is depth of snow on the ground each morning in, say, some particular spot in northern Canada. Even though X is a pretty good predictor of y, the relevant autocorrelation still holds, in the sense that even if I know X, my prediction of today's y value will be much, much better if I also know yesterday's y value.

Of course, in real world situations, we usually never really know for sure whether there is the kind of autocorrelation described above. If you suspect it might be there, then play it safe and use a train/test or cross validation method that respects the time domain by training only on data prior to the test split(s).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.