How to partition data effectively?

I have a pipeline which outputs model scores to s3. I need to partition the data by model_type and date. Which is the most efficient way to partition the data from the below:

  1. s3://bucket/data/model_type=foo/dt=YYYY-MM-DD/a.csv
  2. s3://bucket/data/dt=YYYY-MM-DD/model_type=foo/a.csv

Topic data-engineering

Category Data Science


While doing partitioning of data we need to understand that too many partitions are not a good practice and partition only model type may lead too too few. So before you decide on best partitioning think about following:

  1. How will you access the data.. Which access patterns are more frequent. If you use model and then dates then probably go for first.
  2. If its date go for second one.
  3. Also instead of creating a partition for each day you may create biweekly partitions which will give you best of both worlds

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.