Is it possible to automate generating reproducibility documentation?

Question

Is it possible to automate generating reproducibility documentation?

blunders

2014年5月15日 02:02

First, think it's worth me stating what I mean by replication reproducibility:

Replication of analysis A results in an exact copy of all inputs and processes that are supply and result in incidental outputs in analysis B.
Reproducibility of analysis A results in inputs, processes, and outputs that are semantically incidental to analysis A, without access to the exact inputs and processes.

Putting aside how easy it might be to replicate a given build, especially an ad-hoc one, to me replication always possible if it's planned for and worth doing. That said, it is unclear to me is how to execute a data science workflow that allows for reproducibility.

The closet comparison I'm able to think of is documentation generators that generates software documentation intended for programmers - though the main difference I see is that in theory, if two sets of analysis ran the "reproducibility documentation generators" the documentation should match.

Another issue, is that while I get the concept of reproducibility documentation, I am having a hard time imagining what it would look like in usable form without just being a guide to replicating the analysis.

Lastly, whole intent of this is to understand if it's possible to "bake-in" reproducibility documentation as you build out a stack, not after the stack is built.

So, Is it possible to automate generating reproducibility documentation, and if so how, and what would it look like?

UPDATE: Please note that this is the second draft of this question and that Christopher Louden was kind enough to let me edit the question after I realized it was likely the first draft was unclear. Thanks!

Topic processing

Category Data Science

Christopher Louden · Accepted Answer · 2014年5月14日 22:03

To be reproducible without being just a replication, you would need to redo the experiment with new data, following the same technique as before. The work flow is not as important as the techniques used. Sample data in the same way, use the same type of models. It doesn't matter if you switch from one language to another, so long as the models and the data manipulations are the same.

This type of replication will show that the results you got in the first experiment are less likely to be a fluke than they were earlier.

Is it possible to automate generating reproducibility documentation?

About