Problems to understand how to create the input data for time series forecasting with a recurrent neural network in Keras
I just started to use recurrent neural networks (RNN) with Keras for time-series forecasting and I found this tutorial Forecasting with RNN. I have difficulties understanding how to build the training data both regarding the syntax and the format of the input data.
Here is the code:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from matplotlib import pyplot as plt
# Read the data for the parameters from a csv file
df = pd.read_csv(C:/Users/Python/Data/tutorial_electricityPrice.csv, sep =,)
#Delete the first column as it is not used in the tutorial for forecasting
del df['datetime']
data = df.values
n_steps = 168
series_reshaped = np.array([data[i:i + (n_steps+24)].copy() for i in range(len(data) - (n_steps+24))])
X_train = series_reshaped[:43800, :n_steps]
X_valid = series_reshaped[43800:52560, :n_steps]
X_test = series_reshaped[52560:, :n_steps]
Y = np.empty((61134, n_steps, 24))
for step_ahead in range(1, 24 + 1):
Y[..., step_ahead - 1] = series_reshaped[..., step_ahead:step_ahead + n_steps, 0]
Y_train = Y[:43800]
Y_valid = Y[43800:52560]
Y_test = Y[52560:]
np.random.seed(42)
tf.random.set_seed(42)
model6 = keras.models.Sequential([
keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 6]),
keras.layers.SimpleRNN(20, return_sequences=True),
keras.layers.TimeDistributed(keras.layers.Dense(24))
])
model6.compile(loss=mean_squared_error, optimizer=adam, metrics=['mean_absolute_percentage_error'])
history = model6.fit(X_train, Y_train, epochs=10,batch_size=64,
validation_data=(X_valid, Y_valid))
So in this case 168 hours of the past are used (n_steps) to make a prediction for the next 24 hours of electricity prices. 6 features are used.
I have problems both understanding the format and the syntax for creating the inputdata of the RNN.
Format question
I uploaded a screenshot of the dimensions of the data-arrays from Spyder. So basically we have the full data array 'series_reshaped' with the size (61134, 192, 6). Then we have the input data X_train with the size (43800, 168, 6). The first dimension are the timeslot, the second dimension are the past timeslots that are used for prediction and the third timension are the 6 features for every of the 168 past timeslots. Then we have the labels Y_train with the size (43800, 168, 24). Here I do not understand why we have 168 in the second dimension. As far as I understood for each of the 168 past values * 6 features of the input data, we have 24 target values. So why is the second dimension then not 168*6 = 1008? Because we have a mapping of 1008 inputs to 24 outputs?
Syntax question
I do not really understand how these lines work in Python:
for step_ahead in range(1, 24 + 1):
Y[..., step_ahead - 1] = series_reshaped[..., step_ahead:step_ahead + n_steps, 0]
Why does this create an Y array of the dimension (61134, 168, 24) or transfer the correct data into it? The index step_ahead only takes values from 1 to 24 and now we assign to 24 entries of the second dimension of the array Y 168 values from the past values of the series_reshaped. So why do we only assign the values to the 24 entries of the second dimension of Y and not to the full 168 entries? And why are we looking into the past data of the series_reshaped array (second dimension). For me these lines are extremely confusing altough they apparently do the right thing. Can anyone tell me a little bit more about the syntax of those lines?
Generally, I'd appreciate every comment and would be quite thankful for your help.
Update
Related questions: Hi all, as I still have problems with those question I would like to ask some related questions
- About the creation of the input data: how can I know which structure the input data should have? And how can I then derive something like this code
for step_ahead in range(1, 24 + 1): Y[..., step_ahead - 1] = series_reshaped[..., step_ahead:step_ahead + n_steps, 0]
2)At the end of the training in the tutorial they use the following code for the prediciton
Y_pred = model6.predict(X_test)
last_list=[]
for i in range (0, len(Y_pred)):
last_list.append((Y_pred[i][0][23]))
So they take Y_pred[i][0][23] to construct the 1-dimensional list with the predicted values. Why do they take [0][23] and not for example [1][14]? They want to predict 24 hours in advance. Can I just always take Y_pred[i][0][23] ?
- I still do not understand one of my inital questions: Why is the label dataset Y for training [Batch, 168, 24] if return sequence =true? We use the past 168 values to forecast 24 hours. We use 168*6 features for forecasting. For each element in the batch (each timeslot) we then have an output of 24 hours. So we should have the training data with dimension [Batch, 24] and not [Batch, 168, 24]. For every timeslot in the batch we need 168 past values. How is it then possible to map 24 hours of predictions to every 168 of the past values?
Reminder: My bountry expires in three days and unfortunately I have not received another more comprehensive answer. I'd highly appreciate any new answer that might explain the input data for time series forecasting with a recurrent neural network.
Topic recurrent-neural-network forecasting keras time-series
Category Data Science