How to store efficiently very large sparse 3D matrices

Question

How to store efficiently very large sparse 3D matrices

hH1sG0n3

2021年5月6日 05:03

To train a CNN, I have stacked arrays of images over observations [observations x width x length]. The dataset is very sparse ($95\%$). What would be an efficient way of storing these matrices efficiently in terms of

format (e.g. pickle, parquet)
structure (e.g. scipy.sparse.csr_matrix, List of Lists)

Topic cnn data-formats bigdata

Category Data Science

hH1sG0n3 · Accepted Answer · 2020年11月27日 11:23

Sparse matrix compression techniques is a massively efficient way of storing sparse data. Scipy package has a variety of methods to address the above in scipy.sparse. However, none of these are compatible with matrix dimensions higher than 2.

I have found handy the Sparse package that supports Coordinate List compression (COO), for higher dimension matrices, as in my use case:

Sparse matrix compression with coo

#Load sequence array file
A = np.load('array.npy', allow_pickle=True)
sparsity = 1 - (np.count_nonzero(A) / A.size)
print( "Sparsity of A:%s%%" % np.round(sparsity,3))
Sparsity of A:0.996%

#Calculate coordinate list sparse array of A
S = sparse.COO(A)

# Size calculation.
print('Size of A in bytes: %s' %A.nbytes)
Size of A in bytes: 16563527400
print('Size of S in bytes: %s' %S.nbytes)
Size of S in bytes: 249330624

On disk:

array.npy --> 15.43 GB
array_after.npy  --> 16.40 MB

How to store efficiently very large sparse 3D matrices

About