Diagramming data science workflow?
I'm working on a consulting project for a tech client, and caught myself scratching my head about what the best way to present advanced analytics workflow is. What will be shown to the panel will focus on results, but in this particular case it is warranted to show a visual for what the process behind the scenes is.
Specifically, I need to show the following:
1) Some raw data file is used as input to a cleaning script which performs imputation and adds/drops variables based on some criteria. Some are added with joins on external files.
2) The cleaned data is passed to a script which creates variable subsets dependent on remaining granularity and subsets of interest. This outputs many cleaned data sets.
3) The new data sets are passed to an external computing cluster, which builds a model on each data set, and outputs a table with performance metrics.
4) Top performing models are re-run locally and score observations.
Currently, I am leaning toward UML, as the panel is likely to be familiar with it and appreciate its use. Still, I would like to know if there is any convenient workflow diagramming standard which exists in data science.
Some perspective:
Information systems have historically been well established in diagramming both on a conceptual and hardware level with ERD's and data flow diagrams of all kinds. Software development has also had a heavy emphasis put on this through flowcharts and UML diagrams. Business analytics relies on BPMN to show business processes.
However, when trying to find a diagramming standard in the realm of data science, all I could find were references to the many beautiful and clear visualizations of the findings we make ourselves. We certainly don't shy away from standard methods in visualization (and often for good reason). So, is there such a method for documenting our own workflow, as there is in other fields?
Topic software-development visualization
Category Data Science