How do you do data management?

I have about a few million records (small CSV / JSON file) from different sources, with about 50k added everyday. All on my local host.

Until now, I have been using simple file structure to manage them, but it's getting cumbersome. Ideally, I'd like to query files by their meta data (source, type, etc), and pipe that into my ML pipeline (TFX). Id like to keep them local if possible

does anyone have a good solution that you think will work well ?

All the best!

Johnny

Topic data management dataset

Category Data Science


So after many experiments, this is what I landed with:

Raw CSV gets converted to Parquet, then Parquet gets stored into Minio.

A few considerations:

  1. Parquet files are fast and I don't have to consistently change schema in my code
  2. I can use Apache Drill to query Parquet files stored on Minio directly, and then I can use Superset to do analysis

Cheers

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.