train-test split on forecasting a time series using external features

Question

train-test split on forecasting a time series using external features

tsjm

2022年2月15日 07:48

I have a question regarding the train-test split when forecasting a timeseries using features instead of the time series itself. I know that I should use a time-based train-test-split if i use lagged values of the time series to predict, but I am wondering if that is the case also if I use an external feature. Suppose I try to forecast the watermelon consumption using only the temprature (X feature) instead of using the time series regarding the watermelon. Leaving aside that it might be better to use the time series, would it be valid to do a normal(random) train-test-split for the feature based forecast so I could train using days from november (temperature, watermelon consumption) and testing on unseen data but that it was techinically gathered before (lets say september) ?

Just thinking of the validity of the random tran-test-split, I know that different months might be important and not just the temprature itself, but its just a simple example trying to clarify my concern.

Thanks in advance.

Topic features time-series

Category Data Science

etiennedm · Accepted Answer · 2022年2月15日 07:48

Simple answer is no, you should not.

When performing a forecasting task, you don't want your trained model to have any information about the future it has to forecast. Otherwise, it might use this information.

Now, if you are 100% sure that the latter data have no relation with the earlier, you could do a random train/test split (similarly using time series directly or features). For example, you could create a weather forecast model based on year 1990 and test it on year 1960 (I guess, I am no weather expert).

In any case, if you can do without future data, I would say it is better (as there would be no data leak for sure).

train-test split on forecasting a time series using external features

About