How to pad real-valued sequences

I have several sequences of univariate real-valued time-series data. The sequences are of different lengths and right now I cannot batch them and feed them to a network. What is the correct procedure to pad these sequences? Is it even possible in this case since I can't use any number as a special symbol?

UPDATE 1

I'm working with arbitrary univariate time-series data (not related to one specific domain, unbounded range). To give example of one such a series consider standardized stock dataset (only first 10 elements shown):

d = array([-0.37807043, 0.14321786, -0.37807043, 0.13478392, 0.18733381,
   1.19576774, 0.25675156, 0.26064414, 0.30930144, 0.38650436])

Topic sequence-to-sequence tensorflow

Category Data Science


If I understand you correctly, you have several univariate time series, that you want to stack into a multi-variate one, but cannot do so because they have different length. I think you might find this guide on masking and padding to be a useful starting point. In your case I think masking is a must have as you will have batches when one or more series needs to be padded. An alternative is to train an imputer model first, to impute missing time series from those that are left.


How you pad it (and even whether you do so) would depend on what you expect of the data. This imposes boundary conditions on the data which will induce artifacts in any transform you make. How bad this effect depends on how well geared your data is to accepting a particular padding method.

Padding methods include zero padding or a periodic bound.

Padding doesn't have to be done in the time domain. Eg interpolating in the frequency domain and back transforming allows you to extrapolate.

If your analytics has a finite history (eg FIR filters) then you can isolate time regions where padding is unnecessary and draw comparisons therefrom.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.