Storing Large dataset for processing and analysis of data

I am new to data engineering and wanted to know , what is the best way to store more than 3000 GB of data for further processing and analysis ? I am specifically looking for open source resources . I have explored many data formats for storage . The dataset that I want to store is a heart rate pulse data generated by a sensor.

Topic data-engineering data-analysis data-formats dataset processing

Category Data Science


Depends on the format of the data. A brief overview of your options with their pros and cons.

  • csv - easily processed and shared, can be search from the terminal with grep will be limited after a few dozen of Gbs. Maybe you can break down your dataset into several csvs. (json will also fall into this category)

  • SQL database - if the data is structured and follows a data schema, a traditional SQL database (like PostgreSQL) can be an interesting option. SQL provides an expressive way to retrieve data and a PostgreSQL DB will totally handle 3 Tb data with the appropriate hardware + configuration. Lots of programming languages offer way to integrate with a SQL database like PostgreSQL or SQLite.

  • No SQL database - if the data is not structured or does not follow a data schema, tools like MongoDB, or ElasticSearch can store "key/values" or "documents". A No SQL DB will be able to handle 3 Tb of data with the appropriate hardware and cluster configuration.

  • Time series database - you mention heart rate pulse data, this is likely to be time-series data. You might take a look at db specialized into storing time-series. InfluxDB would be my go-to if the timeseries dimension is the defining feature of the problem you want to solve.


Note: As you say you're getting started with data engineering, this book will provide you with valuable content on how to build data pipeline and select the appropriate tool. Designing Data-Intensive Applications, by Martin Kleppmann


It would depend on the use cases - read vs write vs analytics etc. Nonetheless, you may want to explore Hadoop if not done already.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.