Do I load all files at once or one at a time?

I currently have $1700+$ CSV files. Each of them is in the same format and structure, give or take a row or possibly a column at the end. Each CSV is $\approx 3.8$ MB.

I need to perform a transformation on each file

  1. Extract one data set, perform a summation and column select, then store inside a folder.
  2. Extract an array, no need for column names here, then sore inside a folder.

From an algorithmic POV, is it better to load each file, perform the actions needed and then move on to the next file?

OR

Do I load all files, perform the action on all files and then store to hard drive?

I know how to do the actual process, I am after a 20,000 feet POV of dynamical programming/optimisation.

Thanks.

Topic julia optimization dataset processing

Category Data Science


Since the process can be run independently on every file/batch, I would tend to recommend processing each file one by one for the sake of scalability:

  • Depending on the details of the task there could be some minor optimizations which can be done only by processing the whole data at once, like things related to I/O and memory usage. I can't think of any significant improvement obtainable this way.
  • By itself processing the files independently won't bring any performance improvement either, but it has a massive advantage in terms of scalability: the process can be distributed easily by using multiple cores and processing any number of files on each core. This option offers flexibility for processing the current data (if you can afford multiple cores/machines, the process can be faster) and importantly any future version of the data, even if it grows very large.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.