Successful ETL Automation: Libraries, Review papers, Use Cases
I'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data?
I would be interested to see any existing libraries dealing with scalable ETL solutions. Ideally these would be capable of ingesting 1-5 petabytes of data containing 50 billion records from 100 inhomogenious data sets in tens or hundreds of hours running on 4196 cores (256 I2.8xlarge AWS machines). I really do mean ideally, as I would be interested to hear about a system with 10% of this functionality to help reduce our team's ETL load.
Otherwise, I would be interested to see any books or review articles on the subject or high quality research papers. I have done a literature review and have only found lower quality conference proceedings with dubious claims.
I've seen a few commercial products advertised, but again, these make dubious claims without much evidence of their efficacy.
The datasets are rectangular and can take the form of fixed width files, CSV, TSV, and PSV. Number of fields range from 6 to 150 and contain mostly text based information about entities. Cardinality is large for individual information (address), but smaller for specific details like car type (van, suv, sedan).
Mappings from abbreviated data to human readable formats is commonly needed, as is transformation of records to first-normal-form.
As is likely obvious to the cognoscenti, I am looking for techniques that move beyond deterministic methods, using some sort of semi-supervised or supervised learning model.
I know this is a tall order, but I'm curious to assess the state-of-the-art before embarking on some ETL automation tasks to help guide how far to set our sights.
Thanks for your help!
Topic etl normalization data-cleaning
Category Data Science