Timeseries LSTM: does test data need to come after training data?

Question

Timeseries LSTM: does test data need to come after training data?

lodo

2022年5月21日 10:02

I have one single, very long time series. I want to train an LSTM to distinguish between two behaviours (A or B) at every timestep (sequence-to-sequence).

Because the time series is very long, I plan to extract shorter, partially-overlapping subsequences and use each of them as one training input for the LSTM.

In my train/validation/test split, do I have to use older subsequences for training and newer for validation and test? Or can I treat them as if they were independent samples and just randomly shuffle them, given that anyway the LSTM will start each subsequence with empty memory?

The reason I ask is because I noticed that, due to how the timeseries was collected, the first half contains mostly behaviour A while the second half mostly behaviour B. This would cause training to be mostly on A and testing mostly on B, which does not reflect the fact that, in production, the system will see both periods of predominant A and periods of predominant B.

Topic sequence-to-sequence lstm rnn time-series

Category Data Science

German C M · Accepted Answer · 2021年12月1日 17:51

[EDIT] In case you have very long sequences, you can also try attention-based models to prevent the vanishing gradient issue.

If I got it correctly, you might have the following cases:

labeled samples with tag A or B per date-time index, having as informative attributes the present + some lag values of interest --> (this would be a standard classification approach without time ordering bein necessary)
sliding window of samples (what you mean by subsequences) --> here you should respect the time ordering at least for the validation set, so you make sure you evaluate your LSTM with a real scenario with future never seen sequences.
With this second approach, you can indeed shuffle the training batches (via the keras shuffle(BUFFER_SIZE).batch(BATCH_SIZE) functionality for instance (info here), but leaving the validation set without shuffling, as follows:
```
  BATCH_SIZE = 256
  BUFFER_SIZE = 10000

  train_data_multi = tf.data.Dataset.from_tensor_slices((x_train_multi,y_train_multi))
  train_data_multi = train_data_multi.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE).repeat()

  val_data_multi = tf.data.Dataset.from_tensor_slices((x_val_multi, y_val_multi))
  val_data_multi = val_data_multi.batch(BATCH_SIZE).repeat()
```

You can find a complete worked-out example here

You can also make use of the time series data preprocessing helper where you can decide whether to shuffle or not.

Timeseries LSTM: does test data need to come after training data?

About