How to store efficiently very large sparse 3D matrices

To train a CNN, I have stacked arrays of images over observations [observations x width x length]. The dataset is very sparse ($95\%$). What would be an efficient way of storing these matrices efficiently in terms of

  • format (e.g. pickle, parquet)
  • structure (e.g. scipy.sparse.csr_matrix, List of Lists)

Topic cnn data-formats bigdata

Category Data Science


Sparse matrix compression techniques is a massively efficient way of storing sparse data. Scipy package has a variety of methods to address the above in scipy.sparse. However, none of these are compatible with matrix dimensions higher than 2.

I have found handy the Sparse package that supports Coordinate List compression (COO), for higher dimension matrices, as in my use case:

Sparse matrix compression with coo

#Load sequence array file
A = np.load('array.npy', allow_pickle=True)
sparsity = 1 - (np.count_nonzero(A) / A.size)
print( "Sparsity of A:%s%%" % np.round(sparsity,3))
Sparsity of A:0.996%

#Calculate coordinate list sparse array of A
S = sparse.COO(A)

# Size calculation.
print('Size of A in bytes: %s' %A.nbytes)
Size of A in bytes: 16563527400
print('Size of S in bytes: %s' %S.nbytes)
Size of S in bytes: 249330624

On disk:

array.npy --> 15.43 GB
array_after.npy  --> 16.40 MB

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.