Efficiently modify a large csv file in Pandas

I have a csv file and would like to do the following modification on it:

df = pandas.read_csv('some_file.csv')
df.index = df.index.map(lambda x: x[:-1])
df.to_csv('some_file.csv')

This takes the index, removes the last character, and then saves it again.

I have multiple problems with this solution since my csv is quite large (around 500GB).

First of all, reading and then writing seems not to be very efficient since every line will be fully overwritten, which is not necessary, right?

Furthermore, due to a lack of RAM, I opened this csv in chunks using pandas.read_csv's option of a chunksize. Explicitly, here I do not think this option is a good idea to save every individual chunk and append them to a long csv - especially if I use multiprocessing, since the structure of the csv will be completely messed up.

Is there a better solution to this problem?

Thank you very much in advance.

Topic dataframe csv pandas python bigdata

Category Data Science


I suggest reading only the index from the CSV file and do your modification and copy it back instead of reading the entire CSV file.

You can do that with:

df = pd.read_csv("sample.csv", names=column_names)

Instead of df.index.map(), use panda's .str accessor so that the slicing is vectorized. That will speed up processing on each chunk.

First of all, reading and then writing seems not to be very efficient since every line will be fully overwritten, which is not necessary, right?

Unless the entries in your index are of uniform length (so the final character is always the same number of bytes from the start of the line), I think it is necessary to rewrite every line. Just one of the unfortunate downsides of working with a text-based format like csv.

I could be wrong - maybe there is some more intelligent way to rewrite csv files - but the time and effort to find a library capable of doing so, then also integrating that library with pandas would probably be greater than the time it takes to process the 500GB file traditionally


You can read a file line by line, process each line and write to a new file line by line, this is probably not the most efficient way, but will certainly solve the RAM issue.

For example:

with open("my_file.csv") as f_in, open("new_file.csv", "w") as f_out:
    for line in f_in:
        new_line = line # do you processing here
        f_out.write(new_line)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.