Successful ETL Automation: Libraries, Review papers, Use Cases

I'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data?

I would be interested to see any existing libraries dealing with scalable ETL solutions. Ideally these would be capable of ingesting 1-5 petabytes of data containing 50 billion records from 100 inhomogenious data sets in tens or hundreds of hours running on 4196 cores (256 I2.8xlarge AWS machines). I really do mean ideally, as I would be interested to hear about a system with 10% of this functionality to help reduce our team's ETL load.

Otherwise, I would be interested to see any books or review articles on the subject or high quality research papers. I have done a literature review and have only found lower quality conference proceedings with dubious claims.

I've seen a few commercial products advertised, but again, these make dubious claims without much evidence of their efficacy.

The datasets are rectangular and can take the form of fixed width files, CSV, TSV, and PSV. Number of fields range from 6 to 150 and contain mostly text based information about entities. Cardinality is large for individual information (address), but smaller for specific details like car type (van, suv, sedan).

Mappings from abbreviated data to human readable formats is commonly needed, as is transformation of records to first-normal-form.

As is likely obvious to the cognoscenti, I am looking for techniques that move beyond deterministic methods, using some sort of semi-supervised or supervised learning model.

I know this is a tall order, but I'm curious to assess the state-of-the-art before embarking on some ETL automation tasks to help guide how far to set our sights.

Thanks for your help!

Topic etl normalization data-cleaning

Category Data Science


Just about any ETL tool can manage fixed width, CSV, TSV, or PSV input, and just about any tool should be able to manage 100B records. The limiting part of the question really has to do with what your destination format is, and what disk throughput you need.

Expected throughput on an i2.4xLarge is 250mb/s. If an 8xLarge is double that, times 32 machines, you are looking at the ability to write a petabyte in ~138 hours. Not to mention the time and bandwidth of bringing in the source data in the first place. Unless my math is completely off, that means 30 Petabytes can get written to disk in about 6 months.

It seems odd that you are looking to either normalize or turn into human readable format that much data (it's only going to get bigger), and even odder that you'd want to leverage machine learning as part of a transformation/load of that size.

Your solution will need to be on local hardware in order to keep costs reasonable.

I couldn't recommend a system (commercial or open source) that would scale to the degree necessary to perform this kind of ETL on 30 Petabytes in a matter of days. At that scale, I'd be looking into lots of memory, ram backed/fronted SSDs, and custom development on FPGAs for the actual transformations. Of course, if my math on the write timing is wrong this whole answer is invalid.


I don't think you'll find anything that checks all of your requirements, but here are some things to look at:

  • Automated ETL mapping: There is a tool called Karma started by a team at USC's Information Sciences Institute. It learns from your ETL mappings and helps automate future mappings. It's the only open source tool I'm aware of that helps automate the ETL process, but I would be very interested if there are others out there.
  • Large scale ETL: There are many many tools you could look at for the scalability you are looking for. I can personally recommend looking at Storm and Spark. Storm is excellent for stringing a connection of processing steps together that, given enough resources, can compute in near real time on streaming data. Not too dissimilar Spark has a streaming component with a similar use case, but standard Spark may fit your needs better if the data you are needing to ETL is a fixed set to be processed once.
  • Data storage: You may also need to consider where all this data will live during the ETL lifetime. You may need something like Kafka to deal with large streams of data. Or maybe HDFS to store a static collection of files.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.