How do I create a dataset from many CSV files that is too large for RAM

I have been handed about 40 GB of CSV files that I need to turn into a database. The files are arranged in a file structure that uses location in that file structure to create a relationship between the different CSV files.

/base
  - ancillary_information.csv
  - /Run1
    - /Scenario A
      one.csv
      two.csv
      ...
    - /Scenario B
      one.csv
      ...
  - /Run2
    - /Scenario C
      one.csv
      ...
    - /Scenario D
      one.csv
      ...
  - /Run3
    ...

Each CSV file is for a single sensor that has 4-8 features in it and each feature is a time series. In addition to the file structure itself, there is an ancillary_information CSV file that has a few labels for the RunX and Scenario Y directories. These are not time series, they are dates/times and strings.

I'd like to know how to combine these CSV files into a database that I can use to search and aggregate. Examples of desired functionality:

  • show me all the time series for feature F, if label B has value blah
  • give me the maximum value of feature F from each one.csv (suitably renamed within the database)

I am a python person, so pandas is my go-to for using data, but I cannot figure out how to build a DataFrame on the HD that has a pandas-like relationship between the different DataFrames. Since pandas is in-memory, I guess that isn't possible. For example, I can use HDFStore to store a single DataFrame that I can get into memory. So, I could do that for each Scenario and have one HDF5 file for all the data, and even use the metadata in the HDF5 file to include the labels. But, the DataFrames aren't related, except through the HDF5 data structure. In this case, I've gained pandas functionality at the Scenario level, but I've simply replicated the file structure relationships without the ancillary_information.csv file at the higher levels. I want those higher levels to be related better, too.

Searching for answers on this problem returns innumerable results on how I should deal with getting the data out of a too-large-for-memory-dataset, but not how to get it in. Even when I find packages like Dask, the storage examples are trivial, so I still can't figure out how to get the pieces of data I can handle to stitch together, or if it can even do it.

I can also find recommendations like pyarrow and parquet and blogs gushing over this or that speed/compression/whatever, but those also all have trivial examples.

How do I combine my data into a useful dataset?

Topic dataset bigdata

Category Data Science


This might not be the answer you want, but I think that your best bet here will be to use SQL. There are some nice ways to integrate SQLite with Pandas. Try looking here.

From my limited understanding, SQL is perfect for this sort of situation. You can use it to register different csv files against each other based on certain columns, and it runs off of the hard drive by default. It is built to handle datasets of this size.

For example, this code would select all the columns in csv1, register csv2 against csv1 using columns2 (in csv2) and column1 (in csv1), and return a combined dataset which you could export and use in pandas (depending on the size of your subquery).

SELECT *
  FROM csv1
  JOIN LEFT csv2 ON csv2.column2 = csv1.column1

Beautifully simple, right? I have not personally used this resource (I learned using DataQuest), but if you have the flexibility to learn a bit of SQL for this project, I think it would make your life a LOT easier.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.