What is the best practice for data folder structuring?

I work for a small data science consultancy firm and we are trying to standardize our project folder structure. We started from the cookiecutter structure which is a great base.

However one of the discussion point lies in the subfolders of the data folder, which is structured as:

  • Raw
  • Interim
  • Processed

Let's think about the following situations:

  1. The client gives you a manually extracted csv file -> This obviously goes into Raw
  2. You have acces to SQL databases and make a no-modification extract -> Still into Raw I guess?
  3. Because of very large databases, you create a semi-complex SQL query as base for a feature -> Is this Raw or Interim?

What are the best practices you apply? What would you recommend?

Ps: links to Github projects constructed following this kind of structure are very welcome

Topic project-planning python databases

Category Data Science


What are the best practices you apply? What would you recommend?

My practice is "kind of structured but quite different every time", and I don't recommend it ;)

I suspect that I'm not the only one but I don't have any stats.

Thanks for the link to Cookiecutter, this looks interesting! After reading a bit about it, it looks to me like a key criterion is this: "anyone should be able to reproduce the final products with only the code in src and the data in data/raw."

Based on that I would argue that:

  1. You have acces to SQL databases and make a no-modification extract -> Still into Raw I guess?

Yep, because this data cannot be obtained by processing some other part of the raw data.

  1. Because of very large databases, you create a semi-complex SQL query as base for a feature -> Is this Raw or Interim?

Assuming this query is run on the raw sql database, the resulting data goes to interim, because it can be re-generated by running the query again. The query itself should be stored in src/features I think.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.