ETL and Data Engineering - is it purely the knowledge of tools or is there theory behind it?

I would like to better understand what a good Data Englineer must know or what he does. Job descriptions mostly list tools that are required, such as Python. If it is possible to separate Data Engineering from Data Science, on what principles is Data Engineering based, what is the result of the Data Engineering? Is it creating some data structures? If so, what these structures might be? Are there standards or best practices?

Topic data-engineering data-analysis etl databases

Category Data Science


There certainly is theory, or at least competing methodologies, behind ETL and Data Warehousing, for a start look at the Inmon vs Kimball methodologies.

In a nutshell (I could talk for days on this subject), Bruce Inmon's (the Father of Data Warehousing) methodology revolved around building a large, loosely 3rd normalized data warehouse from multiple sources, that business domain-centric reporting star-schemas could be quickly built and disposed of as needed, whereas Kimball concentrated on (through some staging steps) building directly into reporting structures.

In my experience, whilst the Inmon philosophy looks the more sensible, Inmon based projects, at least those I've been involved with, tended to fail a lot more than Kimball based ones, primarily due to the time and effort required to build the large Data Warehouse before any business value can be seen.

There is a lot more to it, and I've probably let my own experience and opinions taint the purity behind of the methodologies (you can google for larger discussion), but I mention it largely to illustrate that, even in the simple (hah) process of moving and consolidating data, many a religious war has been fought :) Also be aware that most of my practical DW experiences were about a decade ago, so the field has probably moved on.


First of all I just want to say that I am not a data engineer and there is definitely someone out there that can answer this better than me.

I do think that there is a lot of theory behind data engineering. It is also very interesting. I too thought that it was boring and I was more interested in just data science/ machine learning. I am not sure if I can say exactly what principles data engineering is based on but it is about how to best store data, access data and creating underlying systems for more efficient computing. The first paper I read that really got me interested in this stuff was the original paper for Spark.

I also just did a quick google for data engineering PhD and came across this. There is a lot of interesting new research going on with how to store data using "nano-structures". There's also an area of research in quantum databases, which seems like a really interesting database abstraction.

I would be interested in hearing a more informed and complete answer from someone else who is in this field! In fact it might be useful to post this question on another stack exchange site.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.