How to create a big data frame in Python

I have a sparse matrix, $X$, created by TfidfVectorizer and its size is $(500000, 200000)$. I want to convert $X$ to a data frame but I'm always getting a memory error.

I tried

pd.DataFrame(X.toarray(), columns=tokens)

and

pd.read_csv(X.toarray().astype(float32), columns=tokens, chunksize=...).

And it seems that when I convert $X$ to a numpy array using X.toarray(), I get an error.

Can someone tell me what is an easy solution for this? Is there anyway I can create a sparse dataframe from $X$ without memory error?

I have been running my codes on Google Colab Pro and I think it provides me less than 100 GB Ram.

Topic sparse dataframe tfidf python

Category Data Science


You can also use max_df and min_df or max_features for tfidfvectorizer apart from sparse array.


I have had to deal with huge data frames as you mention, in mi case the problem was "solved" by storing the data frame as pickle pd.to_pickle() and not as csv.

The memory usage reduced by 60%

I also heard recently about a format named feather

For reference:

https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d

enter image description here


You can use pandas.Dataframe.sparse.from_spmatrix. It will create a Dataframe populated by pd.arrays.SparseArray from a scipy sparse matrix.

Pandas used to have explicit sparse dataframes, but in more modern versions there is no such concept. Only normal pd.Dataframe populated by sparse data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.