How to impute the missing values in time series for long periods

Question

How to impute the missing values in time series for long periods

Phoenix

2022年2月27日 23:15

I have electrical consumption data between 2016-2019. The data was recorded every 30 minutes for 4 years. There is no data between 13/03/2019 - 31/03/209. I started with pandas.DataFrame.interpolate and I almost tried all methods without any fix for this problem. You can see below some of the results.

df.interpolate(method=nearest)
df.interpolate(method=akima)
df.interpolate(method=time)

Now, I am thinking to use the same data of the last year March 2018 to fill the missing values in March 2019.

Do you think it is the best method to handle this problem? If not, do you have other suggestions? I am asking if there are some packages to handle this problem.

Topic interpolation time-series

Category Data Science

0xedu · Accepted Answer · 2022年2月27日 23:15

There are many counter-examples in using the temporal data a year before inferring the temporal missing values a year after.

I suggest you take a look at the Darts package which is tailored for time series.

As a suggestion, say that you have to infer $m$ missing values, you can proceed as follows. Suppose that you have trained a forecasting model $f(\cdot)$ that forecast the $(n+1)$-th value, say $\hat{v}$, from a generic sequence of $n$ values, say $\langle v_1,v_2,\ldots,v_n \rangle$, that is: $$ \hat{v} = f(\langle v_1,v_2,\ldots,v_n \rangle). $$

To predict the first missing value, say $\hat{v}_1$, out of $m$, call:

$$ \hat{v}_1 = f(\langle v_1,v_2,\ldots,v_n \rangle) $$

where the sequence $\langle v_1,v_2,\ldots,v_n \rangle$ represents the last $n$ values that are known before the first missing value. Now, recursively, having the predicted sequence $\langle \hat{v}_1, \ldots, \hat{v}_{i-1} \rangle$, one can predict the $i$-th value out of $m$, for $1 < i \le m$, by calling:

$$ \hat{v}_i = f(\langle v_i,v_{i+1},\ldots,v_n,\hat{v}_1,\hat{v}_2,\ldots,\hat{v}_{i-1} \rangle). $$

There are pros and cons to this approach. An advantage is that we do not need to exploit any data model for the missing values and to infer using such a model. A disadvantage is that as we incrementally infer each missing value the error will increase since we use predictions (i.e., inferred missing values) to predict the next outcomes; that is, as $m$ increases the error increases.

How to impute the missing values in time series for long periods

About