How to build a data analysis pipeline procedure

I have a series of scripts. Some are in R, some in Python, and others in SAS. I have built them in such a way that one code outputs a .csv file that the next code obtains and then that code outputs a .csv file, and so on...

I want to create a script that will automatically run each script in order so that the final output can be generated automatically.

What method would be best for this and can anyone point me to any examples of the procedure?

Topic management

Category Data Science


Nomially, I would simply just try to write a bash script (or powershell in windows) and just string the commands together. However, this approach is rather fragile as in things get overwritten, and if it's an end to end process that has long batches.

I tend to use a workflow packages like luigi or airflow when stringing dependent task together. The idea for Luigi is that you can break each action into a task. Each task has three needed functions.

  1. Requirement - what needs to exist before this task runs?
  2. Output - where is the output going?
  3. Run - what is the task?

So essentially you just chain a bunch of task and define your run function to call your previously built scripts using something like subprocess. In requirement, you would reference the last step, and for output you would point towards where the file is being written.

The pro for doing this is that if your process breaks in task 50 out of 100, you don't have to rerun all 50 task, luigi will go down the dependency tree until it finds a requirement not fulfilled.

Calling R from Pytho Luigi

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.