What is the best practice to test a ETL pipeline?

In traditional software development practice, before going into production, a piece of code should go through various stages of testing (unit test, integration test, user acceptance test) to secure the stability of the software.

A ETL pipeline, as a piece of code, should also go through these testing steps to build a healthy system.

However due to the nature of ETL process, traditional testing technique may not be applicable.

Is there any reference or guideline specifically focus on testing on ETL pipeline?

Topic etl reference-request

Category Data Science


I have been writing and testing ETL pipelines for a few years now and there are typically two types of pipelines.

  1. Code only pipelines written in python or whatever

  2. GUI pipelines using tools like SSIS or Informatica

The first set you can test like any code, unit and integration tests and the second you can only really test using integration tests I.e. you deploy the code and run it in a test environment.

I expand on how to unit test code pipelines here: https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/

Essentially you will want unit and integration tests and also monitoring which is a form of continuous testing in production :)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.