Time series test data dilema

I’m trying to build a model to predict the amount of sales of a product for the next few days

This question is about whether or not I should use the tail of the serie as the test set and train models using the rest of the data or I should create a test set picking dates at random as usual

Reading about classical time series models (ARIMA), they recommend the first approach (using the last days as test) but I feel weird doing this when applying a machine learning model

What is the correct approach? Any advantage or disadvantage using one or the other?

Topic forecasting machine-learning-model theory time-series

Category Data Science


You could also split your data like this:

~~~~ train ~~~~ test ~~~~ train ~~~~ test ~~~~ ...

Then, you always use one pair of train and test to train and test the model. you can then tune your model hyperparameters based on the average loss it achieves on all the individual test sets in individual runs, given the current value of the hyperparameter. make sure you do not feed any of the test data (also not the lookback window) into the training procedure. this is crucial. compared to the other suggested approach, mine has the advantage that the training set size is always the same. his has the advantage that you train the model on more data.


You can still do cross-validation with time series, but don't just take data points at random. A rolling window is a good way, like:

1st: ~~~~ train ~~~~ train ~~~~ test
2nd: ~~~~ train ~~~~ train ~~~~ train ~~~~ test
3rd: ~~~~ train ~~~~ train ~~~~ train ~~~~ train ~~~~ test

Where test size is about the same size as your forecast horizon

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.