How do I create a dataset from many CSV files that is too large for RAM
I have been handed about 40 GB of CSV files that I need to turn into a database. The files are arranged in a file structure that uses location in that file structure to create a relationship between the different CSV files.
/base
- ancillary_information.csv
- /Run1
- /Scenario A
one.csv
two.csv
...
- /Scenario B
one.csv
...
- /Run2
- /Scenario C
one.csv
...
- /Scenario D
one.csv
...
- /Run3
...
Each CSV file is for a single sensor that has 4-8 features in it and each feature is a time series. In addition to the file structure itself, there is an ancillary_information CSV file that has a few labels for the RunX and Scenario Y directories. These are not time series, they are dates/times and strings.
I'd like to know how to combine these CSV files into a database that I can use to search and aggregate. Examples of desired functionality:
- show me all the time series for feature F, if label B has value blah
- give me the maximum value of feature F from each one.csv (suitably renamed within the database)
I am a python person, so pandas is my go-to for using data, but I cannot figure out how to build a DataFrame on the HD that has a pandas-like relationship between the different DataFrames. Since pandas is in-memory, I guess that isn't possible. For example, I can use HDFStore to store a single DataFrame that I can get into memory. So, I could do that for each Scenario and have one HDF5 file for all the data, and even use the metadata in the HDF5 file to include the labels. But, the DataFrames aren't related, except through the HDF5 data structure. In this case, I've gained pandas functionality at the Scenario level, but I've simply replicated the file structure relationships without the ancillary_information.csv file at the higher levels. I want those higher levels to be related better, too.
Searching for answers on this problem returns innumerable results on how I should deal with getting the data out of a too-large-for-memory-dataset, but not how to get it in. Even when I find packages like Dask, the storage examples are trivial, so I still can't figure out how to get the pieces of data I can handle to stitch together, or if it can even do it.
I can also find recommendations like pyarrow and parquet and blogs gushing over this or that speed/compression/whatever, but those also all have trivial examples.
How do I combine my data into a useful dataset?
Category Data Science