How to apply Stacking cross validation for time-series data?

Normally stacking algorithm uses K-fold cross validation technique to predict oof validation that used for level 2 prediction.

In case of time-series data (say stock movement prediction), K-fold cross validation can't be used and time-series validation (one suggested on sklearn lib) is suitable to evaluate the model performance. In this case no prediction shall be made on first fold and no training shall be made on last fold. How do we use stacking algorithm cross validation technique for time-series data?

Topic ensemble-learning cross-validation time-series

Category Data Science


Standard TimeSeriesSplit from sklearn is not able to work with StackingRegressor because StackingRegressor uses cross_val_predict under the hood. This will result in errors like:

cross_val_predict only works for partitions

To make a stacking with time-series and sklearn models you simply have to write these few lines of code...

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit

# generate dummy data
X = np.random.uniform(0,1, (1000,10))
y = np.random.uniform(0,1, (1000,))

# initialize two models to be stacked
rf = RandomForestRegressor()
gb = GradientBoostingRegressor()

# generate cross-val-prediction with rf and gb using TimeSeriesSplit
cross_val_predict = np.row_stack([
    np.column_stack([
        rf.fit(X[id_train], y[id_train]).predict(X[id_test]),
        gb.fit(X[id_train], y[id_train]).predict(X[id_test]),
        y[id_test]  # we add in the last position the corresponding fold labels
    ])
    for id_train,id_test in TimeSeriesSplit(n_splits=3).split(X)
])  # (test_size*n_splits, n_models_to_stack+1)

# final fit rf and gb with all the available data
rf.fit(X,y)
gb.fit(X,y)

# fit a linear stacking on cross_val_predict
stacking = LinearRegression()
stacking.fit(cross_val_predict[:,:-1], cross_val_predict[:,-1])

# how generate predictions on new unseen data
X_new = np.random.uniform(0,1, (30,10))
pred = stacking.predict(
    np.column_stack([
        rf.predict(X_new),
        gb.predict(X_new)
    ])
)

TL;DR

Time-series algorithms assume that data points are ordered. Traditional K-Fold cannot be used for time series because it doesn't take into account the order in which data points appear. One approach to validate time series algorithms is with Time Based Splitting.

K-Fold vs Time Based Splitting

The two graphs below show the difference between K-Fold and Time Based Splitting. From them, the following characteristics can be observed.


K-Fold always the all data points.

Time Base Splitting uses a fraction of all data points.


K-Fold lets the test set be any data point.

Time Base Splitting only allows the test set to have higher indexes than the training set.


K-Fold will use the first data point for testing and the last data point for training.

Time Base Splitting will never use the first data point for testing and never use the last data point for training.

enter image description here TimeSeriesSplit plot

Scikit-learn implementation

Scikit-learn has an implementation of this algorithm called TimeSeriesSplit.

Look at their documentation, you find the following example:

from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=5)

for train_index, test_index in tscv.split(X):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

>> TRAIN: [0] TEST: [1]
>> TRAIN: [0 1] TEST: [2]
>> TRAIN: [0 1 2] TEST: [3]
>> TRAIN: [0 1 2 3] TEST: [4]
>> TRAIN: [0 1 2 3 4] TEST: [5]

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.