What is the best practice for data folder structuring?

Question

What is the best practice for data folder structuring?

Technologic

2019年10月18日 12:37

I work for a small data science consultancy firm and we are trying to standardize our project folder structure. We started from the cookiecutter structure which is a great base.

However one of the discussion point lies in the subfolders of the data folder, which is structured as:

Raw
Interim
Processed

Let's think about the following situations:

The client gives you a manually extracted csv file -> This obviously goes into Raw
You have acces to SQL databases and make a no-modification extract -> Still into Raw I guess?
Because of very large databases, you create a semi-complex SQL query as base for a feature -> Is this Raw or Interim?

What are the best practices you apply? What would you recommend?

Ps: links to Github projects constructed following this kind of structure are very welcome

Topic project-planning python databases

Category Data Science

Erwan · Accepted Answer · 2019年10月18日 12:37

What are the best practices you apply? What would you recommend?

My practice is "kind of structured but quite different every time", and I don't recommend it ;)

I suspect that I'm not the only one but I don't have any stats.

Thanks for the link to Cookiecutter, this looks interesting! After reading a bit about it, it looks to me like a key criterion is this: "anyone should be able to reproduce the final products with only the code in src and the data in data/raw."

Based on that I would argue that:

You have acces to SQL databases and make a no-modification extract -> Still into Raw I guess?

Yep, because this data cannot be obtained by processing some other part of the raw data.

Because of very large databases, you create a semi-complex SQL query as base for a feature -> Is this Raw or Interim?

Assuming this query is run on the raw sql database, the resulting data goes to interim, because it can be re-generated by running the query again. The query itself should be stored in src/features I think.

What is the best practice for data folder structuring?

About