counter vector fit transform cosine similarity memory error

count_matrix = count.fit_transform(off_data3['bag_of_words'])

I have count_matrix shape with

count_matrix.shape (476147, 482824)

cosine_sim = cosine_similarity(count_matrix, count_matrix)

I think the matrix size is too big to cause this memory error

--------------------------------------------------------------------------- MemoryError Traceback (most recent call last) in

~/venv/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output) 1034 1035 K = safe_sparse_dot(X_normalized, Y_normalized.T, -> 1036 dense_output=dense_output) 1037 1038 return K

~/venv/lib/python3.6/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output) 135 """ 136 if sparse.issparse(a) or sparse.issparse(b): --> 137 ret = a * b 138 if dense_output and hasattr(ret, "toarray"): 139 ret = ret.toarray()

~/venv/lib/python3.6/site-packages/scipy/sparse/base.py in mul(self, other) 479 if self.shape[1] != other.shape[0]: 480 raise ValueError('dimension mismatch') --> 481 return self._mul_sparse_matrix(other) 482 483 # If it's a list or whatever, treat it like a matrix

~/venv/lib/python3.6/site-packages/scipy/sparse/compressed.py in _mul_sparse_matrix(self, other) 514 maxval=nnz) 515 indptr = np.asarray(indptr, dtype=idx_dtype) --> 516 indices = np.empty(nnz, dtype=idx_dtype) 517 data = np.empty(nnz, dtype=upcast(self.dtype, other.dtype)) 518

MemoryError:

Any tips to avoid this memory error when You have large matrix?

Topic data-analysis cosine-distance nlp machine-learning

Category Data Science


It's not clear to me what is your data and what you are trying to do with it, but from what I gather you are trying to calculate cosine similarity for each pair in a cartesian product, right?

If yes then you might want to use "blocking" to reduce the number of comparisons, see https://datascience.stackexchange.com/a/54582/64377.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.