Do I load all files at once or one at a time?

Question

Do I load all files at once or one at a time?

Jonathan Miller

2021年3月26日 10:25

I currently have $1700+$ CSV files. Each of them is in the same format and structure, give or take a row or possibly a column at the end. Each CSV is $\approx 3.8$ MB.

I need to perform a transformation on each file

Extract one data set, perform a summation and column select, then store inside a folder.
Extract an array, no need for column names here, then sore inside a folder.

From an algorithmic POV, is it better to load each file, perform the actions needed and then move on to the next file?

OR

Do I load all files, perform the action on all files and then store to hard drive?

I know how to do the actual process, I am after a 20,000 feet POV of dynamical programming/optimisation.

Thanks.

Topic julia optimization dataset processing

Category Data Science

Erwan · Accepted Answer · 2021年3月26日 10:25

Since the process can be run independently on every file/batch, I would tend to recommend processing each file one by one for the sake of scalability:

Depending on the details of the task there could be some minor optimizations which can be done only by processing the whole data at once, like things related to I/O and memory usage. I can't think of any significant improvement obtainable this way.
By itself processing the files independently won't bring any performance improvement either, but it has a massive advantage in terms of scalability: the process can be distributed easily by using multiple cores and processing any number of files on each core. This option offers flexibility for processing the current data (if you can afford multiple cores/machines, the process can be faster) and importantly any future version of the data, even if it grows very large.

Do I load all files at once or one at a time?

About