How to work with hundreds of CSVs with millions of rows in each?

Question

How to work with hundreds of CSVs with millions of rows in each?

rick458

2021年6月2日 22:45

So I'm doing a project on the COVID-19 Tweets dataset from the IEEE port and I plan to analyse the tweets over the time period from March 2020 till date. The thing is there's more than 300 CSVs for each data with each having millions of rows. Now I need to hydrate all of these tweets before I can go and filter through them. Hydrating just 1 CSV alone took more than two hours today.

I wanted to know if there's a more efficient way I could go about this. Is it possible to combine all the CSVs into one and then hydrate that one file for a really long time or do I have to go for a smaller dataset if it's taking this long for each file.

I'm just starting out into dealing with real-life data so you can consider me a true beginner and any help would be appreciated. Thanks!

Topic text-filter csv sentiment-analysis databases machine-learning

Category Data Science

Erwan · Accepted Answer · 2021年6月2日 22:45

Warning: I don't have any experience with Twitter API so I don't know if there are any specific limitations.

In general with big data, assuming the goal is to apply a repetitive process, it's preferable to split the data into small chunks so that the process can be distributed on multiple cores, ideally using a computing cluster.

However in my experience there's a trade-off between convenience and speed: splitting the data and re-assembling it (MapReduce) requires more manipulations, which implies either more manual work (and possible errors) or more code to implement. This is why I usually consider a distributed design only if either it's easy to implement or it would be really too long to compute sequentially.

In the case you describe, processing a single CSV containing all the data would take around 300x2=600 hours, which is around 25 days... I would probably try to directly process the files in parallel. For example if you have access to only 8 cores it would take around 300x2/8=75 hours, but if you have access to 40 cores it's only around 15 hours, etc.

An additional advantage of batch processing is that if there is an error during the process, you don't need to re-process the whole data but only the batches which failed.

How to work with hundreds of CSVs with millions of rows in each?

About