Best way to store large amounts of time series data? Relational database (SQL) or NoSQL route. Also additional python/pandas question inside

Question

Best way to store large amounts of time series data? Relational database (SQL) or NoSQL route. Also additional python/pandas question inside

user1884295

2021年9月3日 02:25

so I'm working on a project and I'm sort of stuck as to how to store my data. I have a concept I want to propose but am unsure whether it is possible, if it is not I would appreciate any help pointing me in the right direction.

So as I said im working on a small project, for this project I want to store 2, 2 dimensional arrays every 20 seconds or so and have the time (seconds from unix epoch) as the key/reference for each set of readings. Essentially I want to be able to read/retrieve/check/process the data for a specific time based on the time (seconds since unix epoch). These two 2 dimensional arrays are just value pairs so essentially I'm looking to store 2 sets of value pairs every 20 seconds or so. At each sample period the length of one of these value pairs/2 dimensional arrays will always be 1000 entries long whereas the other 2n array can vary from a couple to a couple hundred elements.

Now I plan to continuously collect this data into the foreseeable future. I am using python and I was originally thinking of just using a JSON, CSV etc. format to store this data, however no matter how much or far I looked I couldn't find a way to append data to a file without first loading the entire file into memory or keeping it in memory and exporting to a file once every 24 hours or so. The reason I don't want to do this is because this data will grow quite big pretty quickly and I don't want to keep all the data in memory until I write it or have to reload the file every 20 seconds as obviously this would take up a lot of memory space and ideally I would want to run this as a background process on a raspberry pi so memory is a factor.

Based on this, I figured I would use a relational database like MySQL and just have these readings be a new table with the time (seconds since unix epoch) as the table name. This solves not having to keep a large volume of data in memory until it is written to a file because I can just create a new table every 20 seconds and have a main table that references each of the sub tables. It would also be really nice because I can query data as it is being logged without having to export it first. Is this a good way to handle this data? I already have a second MariaDB instance setup and running for this before I decided creating a new table every 20 seconds may not be very smart or a good idea as I have very little experience and knowledge about relational databases and I couldn't find whether this was ok to do and beyond it just being ok is it the right way to do this?

Then I stumbled upon the HDF5 format and it seemed like this was my solution as, from my little knowledge of the format, it looks like a perfect way to store this type of data however this brings me back to the issue of having to load a file every time I want to append something to it. Does anyone know of a way around this? If there is then this is a viable options and this leads me to my final question that I could not find the answer to, can a pandas dataframe (the format/library I intend to use if this option is the preferred option) element be a reference to another dataframe? The reason I ask is because of the way I planned to implement this. I was thinking of having one dataframe hold all the timestamps and be the master dataframe/array and create a new dataframe with all the readings every 20 seconds and just have a refrence to each of these new dataframes sitting in the master/timestamp dataframe. If there is a better implementation than the one I suggested Im all ears.

So my question is, based on my desires/requirements which method should I go with? The SQL or NoSQL approach. I haven't started this part of my project yet because I can't seem to decide which is the best or right option based on my limited knowledge of databases and storing large datasets. I'm open to any and all suggestions, even if not listed here.

Topic data sql pandas dataset nosql

Category Data Science

Best way to store large amounts of time series data? Relational database (SQL) or NoSQL route. Also additional python/pandas question inside

About