From what volumes of data do data ingestion tools like apache nifi, flume, storm or tools like logstash become relevant?

Question

From what volumes of data do data ingestion tools like apache nifi, flume, storm or tools like logstash become relevant?

Psychotechnopath

2021年12月29日 14:52

I'm working in a company that has two legacy data warehouses, which have evolved to unmaintainable monoliths throughout the time. Therefore, they are in dire need of re-form.

I'm investigating a reform of the current data architecture into an architecture that is more in line with the principles of a data mesh, like advocated in this influential article by Zhamak Dehghani (Probably well-known material to data professionals here).

The first Data warehouse, say DWH-A, mainly consists of data coming directly from the operational databases of the core company application. It it updated weekly through a FTP-dump from operational databases, and every update contains roughly 2GB of data. The DWH has grown to a respectable size of +-300GB over the course of 5 years.

The second Data warehouse, say DWH-B, consists of a wide variety of data coming from all kinds of API's and other data sources. It is updated continuously through API-calls, and has a size of +- 100GB.

Both data warehouses are built mainly with T-SQL and hosted on MS SQL Server. Currently, all data is either inserted from operational databases (through SSIS) or from API's (through SSIS i.c.w. ZappySys).

As I'm given the task to upgrade the current way of doing things, and since I believe that SSIS is a rather superfluous and cumbersome way of inserting data, I'm looking for other ways of ingesting data into some data storage that is more in line with the principles of a data mesh (so no monolith data warehouse).

To this end, I came across tools like Apache nifi, Flume, Storm, Kafka and Logstash. All these tools seem really powerful in their own regard, and suited to handle humongous amounts of data. However, given the volume of data I'm handling, I wonder whether such tools are truly relevant for my company. I don't want to kill a mosquito by firing a bazooka, and unnecessarily complicate things. I can also simply build some Python scripts that run in our K8S cluster, and periodically retrieve/write data into our data storage.