Storage of N-dimensional matrices (tensors) as part of machine learning pipelines

Question

Storage of N-dimensional matrices (tensors) as part of machine learning pipelines

user855

2022年2月22日 21:20

I'm an infra person working on a storage product. I've been googling quite a bit to find an answer to the following question but unable to do so. Hence, I am attemping to ask the question here.

I am aware that relational data or structured data can often be represented in 2-dimensional tables like DataFrames and that can be used for ML training input. If we want to store the DataFrames they can easily be stored as tables in a data-store.

I have also come to know that Tensors (N-dimenional matrices) are used these days for Deep learning tasks. My question is:

Is there a need to store these Tensors back to the disk as part of the overall ML pipeline process? Or do people usually read back the 2-D data frames and then start all the way from there.
What is the format that is used these days for persisting Tensors to the disk.

My understanding is that there is no Tensor storage format. Instead people just load the source data from disk (be a 2-D data frame or Images/videos etc.) and then recompute the Tensors if need be instead of persisting it to disk. Is that correct?

Topic dataframe tensorflow apache-spark apache-hadoop

Category Data Science

bogovicj · Accepted Answer · 2022年2月22日 21:20

Is there a need to store these Tensors back to the disk as part of the overall ML pipeline process? Or do people usually read back the 2-D data frames and then start all the way from there.

Yes, but it's application dependent. You're right, if your 2d data is an image, its usually stored as an images. If your 3D data is video, it's usually stored as video. But there are lots of other kinds of high-dimensional data.

My understanding is that there is no Tensor storage format.

I guess it depends on what you mean by tensor. If you mean n-dimensional array data, there exist many formats / standards. Here are a few:

I know many people / software applications that use hdf5. People in the python world recently like zarr. There is lots of interoperability though, e.g. the zarr / n5 / tensorstore implementations can read each others' data.

Storage of N-dimensional matrices (tensors) as part of machine learning pipelines

About