How to create a big data frame in Python

Question

How to create a big data frame in Python

2021年4月23日 14:41

I have a sparse matrix, $X$, created by TfidfVectorizer and its size is $(500000, 200000)$. I want to convert $X$ to a data frame but I'm always getting a memory error.

I tried

pd.DataFrame(X.toarray(), columns=tokens)

and

pd.read_csv(X.toarray().astype(float32), columns=tokens, chunksize=...).

And it seems that when I convert $X$ to a numpy array using X.toarray(), I get an error.

Can someone tell me what is an easy solution for this? Is there anyway I can create a sparse dataframe from $X$ without memory error?

I have been running my codes on Google Colab Pro and I think it provides me less than 100 GB Ram.

Topic sparse dataframe tfidf python

Category Data Science

MANU · Accepted Answer · 2021年4月23日 14:41

1

MANU answered at 2021年4月23日 14:41

You can also use max_df and min_df or max_features for tfidfvectorizer apart from sparse array.

Multivac · Accepted Answer · 2021年4月10日 22:30

I have had to deal with huge data frames as you mention, in mi case the problem was "solved" by storing the data frame as pickle pd.to_pickle() and not as csv.

The memory usage reduced by 60%

I also heard recently about a format named feather

For reference:

https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d

zachdj · Accepted Answer · 2021年4月10日 22:23

You can use pandas.Dataframe.sparse.from_spmatrix. It will create a Dataframe populated by pd.arrays.SparseArray from a scipy sparse matrix.

Pandas used to have explicit sparse dataframes, but in more modern versions there is no such concept. Only normal pd.Dataframe populated by sparse data.

How to create a big data frame in Python

About